Git Product home page Git Product logo

sebas_project's Introduction

Calibration and evaluation of DNN model

Created on Aug 09 09:29:22 2022 @author: Javier Muro

This code trains a feed forward neural network (modelDNN.py and gkfold_DNN.py) to predict either biomass or species richness.

be_preprocessing.py moduls preprocess the data frame

We can also use a a random forest (gkfold_RF.py) instead of the nn for comparison purposes.

Training and validation points are distributed grouping the replicates to avoid spatial and temporal dependencies.

Alternatively, the user can perform a spatial cross-validation, selecting two regions for training and the remaining region for validation.

Machine learning terminology found across tutorials and courses:

  1. Validation dataset: part of the training dataset used by model to fine tune itself and by you to fine tune predictors and hyper parameters (e.g. epochs, activation function, number of layers...). Set in validation_split when applying the .fit function.

  2. Test dataset: datset never seen by model. Used to evaluate final model's accuracy once hyperparameters and predictors have been determined.

  3. Labels: 'y', study variable, dependent or response variable

  4. Features: 'x', predictors

Dataframe contains grassland vegetation information collected over three regions in Germany. Each region has 50 plots, and each plot is sampled in May for 4 years: 2017-2020

image

Predictors are the interpolated spectral data from the 10 bands of Sentinel-2 at 16 time stamps: from April to October, every 14 days.

image

Import libraries and data

from __future__ import absolute_import, division, print_function, unicode_literals

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import scipy as sp
import statistics
import math

from keras.utils.vis_utils import plot_model

from google.colab import drive, files

Mount google drive to access dataset

drive.mount('/content/drive')
Mounted at /content/drive

Change working directory to a folder in your drive. We will clone my repository there.

%cd .'/drive/MyDrive/SEBAS/Pyproject_modules/SeBAS_project_github/'
/content/drive/MyDrive/SEBAS/Pyproject_modules/SeBAS_project_github

Clone my repository, initialize it, add origin and pull to access all the files

! git clone https://github.com/Havi-muro/SeBAS_project.git
Cloning into 'SeBAS_project'...
remote: Enumerating objects: 338, done.�[K
remote: Counting objects: 100% (255/255), done.�[K
remote: Compressing objects: 100% (178/178), done.�[K
remote: Total 338 (delta 139), reused 187 (delta 74), pack-reused 83
Receiving objects: 100% (338/338), 109.49 MiB | 12.71 MiB/s, done.
Resolving deltas: 100% (171/171), done.
Checking out files: 100% (88/88), done.
! git init
Initialized empty Git repository in /content/drive/MyDrive/SEBAS/Pyproject_modules/SeBAS_project_github/.git/
! git remote add origin https://github.com/Havi-muro/SeBAS_project.git
! git pull
remote: Enumerating objects: 338, done.�[K
remote: Counting objects: 100% (255/255), done.�[K
remote: Compressing objects: 100% (178/178), done.�[K
remote: Total 338 (delta 139), reused 187 (delta 74), pack-reused 83�[K
Receiving objects: 100% (338/338), 109.49 MiB | 12.54 MiB/s, done.
Resolving deltas: 100% (171/171), done.
From https://github.com/Havi-muro/SeBAS_project
 * [new branch]      main       -> origin/main
There is no tracking information for the current branch.
Please specify which branch you want to merge with.
See git-pull(1) for details.

    git pull <remote> <branch>

If you wish to set tracking information for this branch you can do so with:

    git branch --set-upstream-to=origin/<branch> master

Change the working directory again to the repository folder

%cd SeBAS_project
/content/drive/MyDrive/SEBAS/Pyproject_modules/SeBAS_project_github/SeBAS_project
Mydataset_0 = pd.read_csv('data/Bexis_Sentinels_full_dataset.csv') 

Explore and prepare the data

Visualize response variables and predictors.

print(list(Mydataset_0.columns))
['Year', 'ep', 'yep', 'explo', 'x', 'y', 'number_vascular_plants', 'SpecRichness', 'height_cm', 'biomass_g', 'NMDS1', 'NMDS2', 'SpecRich_157', 'Shannon_157', 'Simpson_157', 'inverse_Simpson_157', 'PielouEvenness_157', 'Rao_Q_157', 'Redundancy_157', 'Shannon', 'Simpson', 'FisherAlpha', 'PielouEvenness', 'LUI_2015_2018', 'SoilTypeFusion', 'slope', 'aspect', 'LAI', 'blue', 'green', 'red', 'nir', 'nirb', 're1', 're2', 're3', 'swir1', 'swir2', 'EVI', 'SAVI', 'GNDVI', 'ARVI', 'CHLRE', 'MCARI', 'NDII', 'MIRNIR', 'MNDVI', 'NDVI', 'VHMean_May', 'VVMean_May', 'VVVH', 'TWI', 'blue_0', 'green_0', 'red_0', 'nir_0', 'nirb_0', 're1_0', 're2_0', 're3_0', 'swir1_0', 'swir2_0', 'blue_1', 'green_1', 'red_1', 'nir_1', 'nirb_1', 're1_1', 're2_1', 're3_1', 'swir1_1', 'swir2_1', 'blue_2', 'green_2', 'red_2', 'nir_2', 'nirb_2', 're1_2', 're2_2', 're3_2', 'swir1_2', 'swir2_2', 'blue_3', 'green_3', 'red_3', 'nir_3', 'nirb_3', 're1_3', 're2_3', 're3_3', 'swir1_3', 'swir2_3', 'blue_4', 'green_4', 'red_4', 'nir_4', 'nirb_4', 're1_4', 're2_4', 're3_4', 'swir1_4', 'swir2_4', 'blue_5', 'green_5', 'red_5', 'nir_5', 'nirb_5', 're1_5', 're2_5', 're3_5', 'swir1_5', 'swir2_5', 'blue_6', 'green_6', 'red_6', 'nir_6', 'nirb_6', 're1_6', 're2_6', 're3_6', 'swir1_6', 'swir2_6', 'blue_7', 'green_7', 'red_7', 'nir_7', 'nirb_7', 're1_7', 're2_7', 're3_7', 'swir1_7', 'swir2_7', 'blue_8', 'green_8', 'red_8', 'nir_8', 'nirb_8', 're1_8', 're2_8', 're3_8', 'swir1_8', 'swir2_8', 'blue_9', 'green_9', 'red_9', 'nir_9', 'nirb_9', 're1_9', 're2_9', 're3_9', 'swir1_9', 'swir2_9', 'blue_10', 'green_10', 'red_10', 'nir_10', 'nirb_10', 're1_10', 're2_10', 're3_10', 'swir1_10', 'swir2_10', 'blue_11', 'green_11', 'red_11', 'nir_11', 'nirb_11', 're1_11', 're2_11', 're3_11', 'swir1_11', 'swir2_11', 'blue_12', 'green_12', 'red_12', 'nir_12', 'nirb_12', 're1_12', 're2_12', 're3_12', 'swir1_12', 'swir2_12', 'blue_13', 'green_13', 'red_13', 'nir_13', 'nirb_13', 're1_13', 're2_13', 're3_13', 'swir1_13', 'swir2_13', 'blue_14', 'green_14', 'red_14', 'nir_14', 'nirb_14', 're1_14', 're2_14', 're3_14', 'swir1_14', 'swir2_14', 'blue_15', 'green_15', 'red_15', 'nir_15', 'nirb_15', 're1_15', 're2_15', 're3_15', 'swir1_15', 'swir2_15', 'S2Q', 'S2QF']
Mydataset_0.head(5)
Year ep yep explo x y number_vascular_plants SpecRichness height_cm biomass_g ... red_15 nir_15 nirb_15 re1_15 re2_15 re3_15 swir1_15 swir2_15 S2Q S2QF
0 2017 AEG01 012017ALB ALB 525313 5360594 14.0 33.0 13.38 153.25 ... 418.0 4564.0 4418.0 1368.0 3778.0 4277.0 2138.0 955.0 0.680651 0.656078
1 2017 AEG02 022017ALB ALB 535010 5358415 12.0 21.0 18.75 164.75 ... 404.0 3988.0 3931.0 1152.0 3123.0 3695.0 1968.0 915.0 0.976994 0.359795
2 2017 AEG03 032017ALB ALB 539398 5370311 18.0 35.0 4.53 66.60 ... 506.0 3242.0 3053.0 1157.0 2526.0 2891.0 2050.0 1009.0 0.677909 0.485461
3 2017 AEG04 042017ALB ALB 531016 5359610 14.0 22.0 19.75 189.50 ... 443.0 4101.0 3999.0 1239.0 3231.0 3760.0 2238.0 1077.0 0.700765 0.802787
4 2017 AEG05 052017ALB ALB 532511 5359895 17.0 20.0 15.25 195.25 ... 324.0 4997.0 4810.0 1166.0 3875.0 4636.0 2086.0 908.0 0.724976 0.789717

5 rows × 214 columns

Plot a couple of variables against each other

sns.pairplot(Mydataset_0[['nir_4','biomass_g']], diag_kind="biomass_g")
<seaborn.axisgrid.PairGrid at 0x7f55d4a58fd0>

image

Select the study variable ('biomass_g' or 'SpecRichness') and run the preprocesing module. We can change the predictors used within this module. It returns two dataframes, one to analyze, and another one with relevant environmental information to color the predictions vs labels plots later on.

studyvar = 'biomass_g'
import be_preprocessing

Create datasets needed in the global environment. MydatasetLUI is contains information on land use intensity, in case we want to plot predicted vs in situ, color coded by land use

Mydataset = be_preprocessing.be_preproc(studyvar)[0]

MydatasetLUI = be_preprocessing.be_preproc(studyvar)[1]
print(Mydataset.head())
print(list(Mydataset.columns))
      ep  biomass_g   blue  green    red     nir    nirb     re1     re2  \
0  AEG01     153.25  345.0  708.0  434.0  4232.0  4098.0  1159.0  3351.0   
1  HEG01      93.60  344.0  709.0  410.0  4444.0  4301.0  1096.0  3369.0   
2  SEG01     173.30  274.0  609.0  319.0  4079.0  3960.0  1010.0  3110.0   
3  AEG01     197.50  360.0  703.0  471.0  4577.0  4351.0  1072.0  3441.0   
4  HEG01     128.90  376.0  758.0  474.0  4068.0  3868.0  1163.0  3089.0   

      re3   swir1   swir2  
0  3967.0  1976.0   950.0  
1  4125.0  1900.0   901.0  
2  3731.0  1753.0   798.0  
3  4168.0  2082.0   946.0  
4  3685.0  2086.0  1018.0  
['ep', 'biomass_g', 'blue', 'green', 'red', 'nir', 'nirb', 're1', 're2', 're3', 'swir1', 'swir2']

##Run the DNN model

The data contain replicates (4 observations per plot, one per year during 4 years). Thus, we have to ensure that all replicates are always in the same fold. For that, gkfold_DNN.py groups the observations by plot. The gkfold_DNN.py module uses the modelDNN.py module, which contains the architecture of the DNN. It uses also plot_loss.py that displays the loss function across epochs. We build a df with the predictions for the 5 folds and link them to the original dataset

import gkfold_DNN

EPOCHS = 500

gkfold_DNN.gkfold_DNN(EPOCHS, studyvar)

# build a df of the accumulated predictions vs labels
pred_trues = gkfold_DNN.pred_trues
pred_truesdf = pd.concat(pred_trues)

# build also a df with the test features in the order in which they were split
testfeatures_order = gkfold_DNN.testfeatures_order2
testfeatures_orderdf = pd.concat(testfeatures_order)

# concatenate both laterally. They should correspond.
rsdf = pd.concat([testfeatures_orderdf, pred_truesdf], axis=1)

# merge it with the original dataset using the features as joining field
# This allows us to display preds vs lables along with other variables
Mypredictions = pd.merge(rsdf, MydatasetLUI)
Fold#1
Epoch 126: early stopping
Fold#2
Epoch 168: early stopping
Fold#3
Epoch 106: early stopping
Fold#4
Epoch 167: early stopping
Fold#5
Epoch 170: early stopping

image

The plot above shows the accumulated loss and validation_loss functions for the five folds.

##Plot results

Plot predictions versus labels using a density plot

#Make a density plot
from scipy.stats import gaussian_kde

y = Mypredictions['preds']#.astype(float)
x = Mypredictions['labels']

# Calculate the point density
xy = np.vstack([x,y])
z = gaussian_kde(xy)(xy)

fig, ax = plt.subplots()
ax.scatter(x, y, c=z, s=100)

#plt.xlabel('Biomass $(g/m^{2})$')
#plt.ylabel('Predicted biomass $(g/m^{2})$')
plt.ylabel(f'Predicted {studyvar}')
plt.xlabel(f'In situ {studyvar}')
plt.xlim(0, max(Mydataset[studyvar]))
plt.ylim(0, max(Mydataset[studyvar]))
#add a r=1 line
line = np.array([0,max(Mydataset[studyvar])])
plt.plot(line,line,lw=1, c="black")
plt.show()

image

#fig.savefig('sppRich_2022.png')
#files.download('sppRich_2022.png')
<IPython.core.display.Javascript object>



<IPython.core.display.Javascript object>

Plot predictions vs labels colored by some field such as LUI

Mypredictions = Mypredictions.rename(columns = {'LUI_2015_2018':'LUI'})

# Plot predictions color coded with LUI
myplot = sns.scatterplot(data=Mypredictions,
                         y='preds',
                         x=studyvar,
                         hue = 'LUI', 
                         palette='viridis',
                         #cmap = 'Reds',
                         linewidth=0,
                         s = 20
                         )
plt.ylabel(f'{studyvar} Predicted')
plt.xlabel(f'{studyvar} insitu values')
myplot.legend(title="LUI")
plt.xlim(0, max(Mydataset[studyvar]))
plt.ylim(0, max(Mydataset[studyvar]))

#add a r=1 line
line = np.array([0,max(Mydataset[studyvar])])
plt.plot(line,line,lw=1, c="black")
plt.show()
#fig.savefig(f'{studyvar}allfolds_plot2.svg')

image

Another way to plot results using the compiled df of predictions vs labels

"""
% Matlab function to calculate model evaluation statistics 
% S. Robeson, November 1993
% Author of matlab function: Zbynek Malenovksy

% zb(1):  mean of observed variable 
% zb(2):  mean of predicted variable 
% zb(3):  std dev of observed variable 
% zb(4):  std dev of predicted variable 
% zb(5):  correlation coefficient
% zb(6):  intercept of OLS regression
% zb(7):  slope of OLS regression
% zb(8):  mean absolute error (MAE)
% zb(9):  index of agreement (based on MAE)
% zb(10): root mean squared error (RMSE)
% zb(11): relative root mean squared error (RMSE)
% zb(12): RMSE, systematic component
% zb(13): RMSE, unsystematic component
% zb(14): index of agreement (based on RMSE)

""" 
# I have to import in each iteration
import gkfold_DNN 
# import kfold_RF
met_ls=[]
for i in range(3):
    
    EPOCHS = 500
    
    # We build a df of the accumulated predictions vs labels
    # for DNN or RF

    gkfold_DNN.gkfold_DNN(EPOCHS, studyvar)
    pred_trues = gkfold_DNN.pred_trues

    #kfold_RF.kfold_RF(studyvar)
    #pred_trues = kfold_RF.pred_trues
    
    pred_truesdf = pd.concat(pred_trues).reset_index(drop=True)
    
    # Select the last batch of predictions.
    # Predictions accumulate when we iterate 10 times
    # even if we delete all the variables
    pred_truesdf = pred_truesdf.tail(Mydataset.shape[0])
    
    # Predictions vs labels
    y = pred_truesdf['preds']
    x = pred_truesdf['labels']
    
    # Calculate the point density
    xy = np.vstack([x,y])
    z = gaussian_kde(xy)(xy)
    
    fig, ax = plt.subplots()
    ax.scatter(x, y, c=z, s=100)
    
    plt.ylabel(f'Predicted {studyvar}')
    plt.xlabel(f'In situ {studyvar}')
    plt.xlim(0, max(Mydataset[studyvar]))
    plt.ylim(0, max(Mydataset[studyvar]))
    # add a r=1 line
    line = np.array([0,max(Mydataset[studyvar])])
    plt.plot(line,line,lw=1, c="black")
    plt.show()
    
    # Calculate additional metrics
    n = len(x)
    so = x.sum()
    sp = y.sum()
    
    sumo2 = (x**2).sum()
    sump2 = (y**2).sum()
    
    sum2 = ((x-y)**2).sum()
    sumabs = abs(x-y).sum()
    
    sumdif = (x-y).sum()
    cross = (x*y).sum()
    
    obar = x.mean()
    pbar = y.mean()
    
    sdo = math.sqrt(sumo2/n - obar*obar)
    sdp = math.sqrt(sump2/n-pbar*pbar)
    c = cross/n - obar*pbar
    r = c/(sdo*sdp)
    r2 = r**2
    b = r*sdp/sdo
    a = pbar - b*obar
    mse = sum2/n
    mae = sumabs/n
    
    msea = a**2
    msei = 2*a*(b-1)*obar
    msep = ((b-1)**2) *sumo2/n
    mses = msea + msei + msep
    mseu = mse - mses
    
    rmse = math.sqrt(mse)
    rrmse = rmse/obar
    rmses = math.sqrt(mses)
    rmseu = math.sqrt(mseu)
        
    pe1 = (abs(y-obar) + abs(x-obar)).sum()
    pe2 = ((abs(y-obar) + abs(x-obar))**2).sum()
    d1 = 1 - n*mae/pe1;
    d2 = 1 - n*mse/pe2;
    
    zb = [obar,pbar,sdo,sdp,r,a,b,mae,d1,rmse,rrmse,rmses,rmseu,d2]
    
    results = [r2, rrmse, rmses, rmseu]
    
    met_ls.append(results)
# Check that length = number of loops    
len(met_ls)
3

Calculate mean and standard deviation over the k folds performed

r2_hat = statistics.mean([x[0] for x in met_ls])
r2_sd = statistics.pstdev([x[0] for x in met_ls])

rrmse_hat = statistics.mean([x[1] for x in met_ls])
rrmse_sd = statistics.pstdev([x[1] for x in met_ls])

rmses_hat = statistics.mean([x[2] for x in met_ls])
rmses_sd = statistics.pstdev([x[2] for x in met_ls])

rmseu_hat = statistics.mean([x[3] for x in met_ls])
rmseu_sd = statistics.pstdev([x[3] for x in met_ls])

Print the accuracy metric statistics of the k folds

print('r2_hat: ' '%.2f'% r2_hat)
print('r2_sd: ''%.2f'% r2_sd)

print('rrmse_hat: ' '%.2f'% rrmse_hat)
print('rrmse_sd: ' '%.2f'% rrmse_sd)

print(f'rmses_hat: ' '%.2f'% rmses_hat)
print(f'rmses_sd: ' '%.2f'%rmses_sd)

print(f'rmseu_hat: ' '%.2f'% rmseu_hat)
print(f'rmseu_sd: ' '%.2f'%rmseu_sd)
r2_hat: 0.43
r2_sd: 0.01
rrmse_hat: 0.48
rrmse_sd: 0.00
rmses_hat: 56.34
rmses_sd: 0.74
rmseu_hat: 50.68
rmseu_sd: 0.34

#Random forest option

Rinse and repeat for Random Forest. Reset the runtime because values from previous runs might remain stored and do not dissapear even if deleting all variables.

# 2.- Group kfold approach with RF
import gkfold_RF
gkfold_RF.gkfold_RF(studyvar)

# build a df of the accumulated predictions vs labels
pred_trues = gkfold_RF.pred_trues
pred_truesdf = pd.concat(pred_trues)

# build also a df with the test features in the order in which they were split
testfeatures_order = gkfold_RF.testfeatures_order2
testfeatures_orderdf = pd.concat(testfeatures_order)

# concatenate both laterally. They should correspond.
rsdf = pd.concat([testfeatures_orderdf, pred_truesdf], axis=1)

# merge it with the original dataset using the features as joining field
# This allows us to display preds vs lables along with other variables
Mypredictions = pd.merge(rsdf, MydatasetLUI)

importance_list = gkfold_RF.importance_list
Fold#1
Fold#2
Fold#3
Fold#4
Fold#5
#Make a density plot
y = Mypredictions['preds']#.astype(float)
x = Mypredictions['labels']

# Calculate the point density
xy = np.vstack([x,y])
z = gaussian_kde(xy)(xy)

fig, ax = plt.subplots()
ax.scatter(x, y, c=z, s=100)

plt.ylabel(f'Predicted {studyvar}')
plt.xlabel(f'In situ {studyvar}')
plt.xlim(0, max(Mydataset[studyvar]))
plt.ylim(0, max(Mydataset[studyvar]))
#add a r=1 line
line = np.array([0,max(Mydataset[studyvar])])
plt.plot(line,line,lw=1, c="black")
plt.show()

image

Spatial cross-validation

An alternative to the group k-fold validation is the spatial cross validation, where two regions (explo) are used for training, and a third one for validation. It is set to run 5 times to account for random effects.

import spcv_DNN

# Choose which site is used for test and which one(s) for training
EPOCHS = 200
Mydataset['explo'] = MydatasetLUI['explo']
Mydataset=Mydataset.drop('ep', axis=1)
train_dataset = Mydataset[(Mydataset['explo']=='SCH')       
                          | (Mydataset['explo'] == 'HAI')   # take this line out to use only 1 site for training
                          ].drop(['explo'], axis=1)
                            
test_dataset = Mydataset[Mydataset['explo']=='ALB'].drop(['explo'], axis=1)

spcv_DNN.spcv_DNN(EPOCHS, train_dataset, test_dataset, studyvar)

# Put results as variables in global environment

RMSE_test_list = spcv_DNN.RMSE_test_list
RRMSE_test_list = spcv_DNN.RRMSE_test_list
rsq_list = spcv_DNN.rsq_list

predictions_list = spcv_DNN.predictions_list

image

Print the accuracy metrics of each iteration

print(rsq_list)
print(RMSE_test_list)
print(RRMSE_test_list)
[0.4407638533454107, 0.45785435115421924, 0.45102480701665704, 0.4414485989861104, 0.4436519028718409]
[62.379330983620584, 61.24431598508869, 61.959981292801345, 62.75660529768834, 62.66722208521514]
[0.4006665564177742, 0.39337628023536836, 0.39797304570026554, 0.40308981421581147, 0.40251569994785624]

Create model with all observations

Once we know how well our model performs after multiple cross-validations, we will obtain the best results calibrating our model with all observations (I'm aware that this is frawned upon by many, but to produce the best predictions, it is always better to have the full range of observations).

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from keras.callbacks import EarlyStopping
import modelDNN
import plot_loss

import tensorflow as tf
#from tensorflow import keras
from tensorflow.keras.layers.experimental import preprocessing
Mydataset = Mydataset.drop('ep', axis=1)
print(Mydataset.head())
   biomass_g   blue  green    red     nir    nirb     re1     re2     re3  \
0     153.25  345.0  708.0  434.0  4232.0  4098.0  1159.0  3351.0  3967.0   
1      93.60  344.0  709.0  410.0  4444.0  4301.0  1096.0  3369.0  4125.0   
2     173.30  274.0  609.0  319.0  4079.0  3960.0  1010.0  3110.0  3731.0   
3     197.50  360.0  703.0  471.0  4577.0  4351.0  1072.0  3441.0  4168.0   
4     128.90  376.0  758.0  474.0  4068.0  3868.0  1163.0  3089.0  3685.0   

    swir1   swir2  
0  1976.0   950.0  
1  1900.0   901.0  
2  1753.0   798.0  
3  2082.0   946.0  
4  2086.0  1018.0  
# All observations are training
train_dataset = Mydataset
    
# Copy features
train_features = train_dataset.copy()

# Separate labels and features
train_labels = train_features.pop(studyvar)

#Create normalizer layer and adapt it to our data
normalizer = preprocessing.Normalization()
normalizer.adapt(np.array(train_features))

################################################################################
model = modelDNN.build_model(normalizer, train_features)
EPOCHS = 500
################################################################################
   
#Add an early stopping to avoid overfitting
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=50)

#Train model
#Keras fit function expects training features to be as array.
history = model.fit(
    train_features, 
    train_labels, 
    epochs=EPOCHS, 
    validation_split = 0.2, 
    verbose=0
    ,callbacks=[es]
    )

hist = pd.DataFrame(history.history)
hist['epoch'] = history.epoch

plot_loss.plot_loss(history, EPOCHS, studyvar)

# Measure RMSE using the validation (0.2%) data
RMSE_val = hist[(hist.epoch == (EPOCHS - 1))][['val_root_mean_squared_error']].squeeze()
Epoch 130: early stopping

image

Save the model so that we can apply it to a raster in the next section

tf.keras.models.save_model(model, filepath = 'spatial/'+f'{studyvar}_adam_model_S2bands_02Sep2022_3')

Apply model to raster

Once the model has been calibrated, it can be applied to a stack of raster images.

First we have to install the libraries to handle rasters

%%capture
!pip install earthpy
!pip install pyrsgis
#conda install -c pratyusht pyrsgis
#conda install --channel "pratyusht" package
!pip install rioxarray
import earthpy.spatial as es
from pyrsgis import raster
from pyrsgis.convert import changeDimension
import numpy as np
from tensorflow import keras
import earthpy.plot as ep
#import geopandas as gpd
import rioxarray as rxr
from shapely.geometry import mapping
from glob import glob
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
%cd .'/drive/MyDrive/SEBAS/Pyproject_modules/SeBAS_project/'
/content/drive/MyDrive/SEBAS/Pyproject_modules/SeBAS_project

Read the model

model = keras.models.load_model(f'spatial/{studyvar}_adam_model_S2bands_02Sep2022/')

List all the rasters and stack them in the same order as the predictors used to calibrate the model.

file_list = glob("data/rs/*.tif")
file_list
['data/rs/alb_20200524_b01.tif',
 'data/rs/alb_20200524_b02.tif',
 'data/rs/alb_20200524_b03.tif',
 'data/rs/alb_20200524_b04.tif',
 'data/rs/alb_20200524_b05.tif',
 'data/rs/alb_20200524_b06.tif',
 'data/rs/alb_20200524_b07.tif',
 'data/rs/alb_20200524_b08.tif',
 'data/rs/alb_20200524_b09.tif',
 'data/rs/alb_20200524_b10.tif']
arr_st, meta = es.stack(file_list)

In order to display it, we might have to mask max and min values

myrast_masked = np.ma.masked_values(arr_st, np.amin(arr_st))
myrast_masked = np.ma.masked_values(myrast_masked, np.max(myrast_masked))
fig, ax = plt.subplots(figsize=(12, 12))

ep.plot_rgb(myrast_masked, rgb=(3, 2, 1), ax=ax, title="Sentinel-2 NIR-G-R")
plt.show()

image

ep.hist(myrast_masked)
plt.show()

image

Change array to table (formerly called changeDimensions(), but will be deprecated in the future for array_to_table())

myrast_reshape = changeDimension(arr_st)
The "changeDimension()" function has moved to "array_to_table()" and will be deprecated in future versions. Please check the pyrsgis documentation at https://pyrsgis.readthedocs.io/en/master/ for more details.

Apply model

myrast_pred = model.predict(myrast_reshape)

Reshape the prediction to the original shape (n of bands, x_size and y_size)

arr_st.shape
(10, 2243, 3194)
prediction = np.reshape(myrast_pred, (arr_st.shape[1], arr_st.shape[2]))

Predictions for areas that are not grasslands (e.g. dry or flooded land) return values aburdly high. Set a max value (x ej, max value of training) and make nan all the rest.

prediction[prediction>900] = np.nan
plt.imshow(prediction.squeeze())
<matplotlib.image.AxesImage at 0x7f058b193110>

image

It is still necessary to clip it to grasslands

import nbconvert

sebas_project's People

Contributors

havi-muro avatar

Stargazers

 avatar  avatar  avatar  avatar Yi Dong avatar Álvaro G. Dieste avatar Felix Nößler avatar

Watchers

 avatar  avatar

Forkers

wlongzhu

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.