usgs-r / drb-gw-hw-model-prep Goto Github PK

Code repo to prepare groundwater and headwater-related datasets for modeling river temperature in the Delaware River Basin

License: Other

R 80.03% Python 19.97%

drb-gw-hw-model-prep's People

Contributors

Watchers

Forkers

lekoenig msleckman janetrbarclay

drb-gw-hw-model-prep's Issues

Figure out how to pull `drb_climate_2022_06_14.nc` in pipeline with S3

File is 15GB.

Need to pull drb_climate_2022_06_14.nc more efficiently. Let's explore placing it in an S3 bucket that we can request directly in 1_fetch.R

drb-gw-hw-model-prep/1_fetch.R

Lines 177 to 186 in f4ab37a

 # Read in meteorological data aggregated to NHDPlusV2 catchments for the  

 # DRB (prepped in https://github.com/USGS-R/drb_gridmet_tools). Note that 

 # the DRB met data file must be stored in 1_fetch/in. If working outside 

 # of tallgrass/caldera, this file will need to be downloaded from the 

 # PGDL-DO project's S3 bucket and manually placed in 1_fetch/in. 

 tar_target( 

 p1_drb_nhd_gridmet, 

 "1_fetch/in/drb_climate_2022_06_14.nc", 

 format = "file" 

 ),

Omit unused MODFLOW values from `p1_modflow_discharge`

From @janetrbarclay:

We don't use q_std, q_std_per and nDown. They could be removed from the feather file

This came up in response to a question from Margaux when creating the data dictionary.

Resolve remaining issues running adjusted NHM widths

On second look, there appear to be some errors with running the adjusted NHM widths from #30 through river-dl.

From @janetrbarclay:

We need to change "segidnat" to "seg_id_nat" in the column names
Also, seg_id_nat should be numeric (either int or float is fine)
3 segments (1437, 1442, and 1485) have duplicate entries in the attribute feather file with different widths
1 segment (1721) is missing an entry
[there 1 extra segment (3558) but this isn't a problem]

COMIDs in nhd inputs zarr don't match COMIDs in distance matrix

When testing the zarr data store in #19, Janet noticed that there's a mismatch in the COMIDs present in the distance matrix (from drb-network-prep) and those present in p2_input_drivers_nhd, the data frame that gets written to zarr and contains the input drivers for our NHD-downscaling version of the model.

There are 13 reaches in the zarr file that are not in the distance matrix:

[1748579, 8076904, 8077128, 8077146, 8077148, 8077154, 8077164, 8077166, 9484328, 9484338, 9484406, 9484460, 9486550]

And there are 51 reaches in the distance matrix that are not in the driver data:

[1748581, 1748739, 1750633, 2585051, 2587011, 2588371, 2614050, 2615272, 2615284, 2743164, 2743508, 2743520, 2743522, 4148042, 4150536, 4151818, 4153726, 4186295, 4186425, 4188023, 4188041, 4188275, 4489216, 4490988, 4492442, 4492448, 4495872, 4495884, 4495910, 4496232, 4496410, 4496424, 4496436, 4496460, 4648744, 4648766, 4652082, 4655446, 4782271, 4782507, 4782567, 4783437, 4783475, 8076898, 8077118, 8077572, 8077738, 9481952, 9481980, 9482048, 9485884]

This happens because we're currently grabbing the COMIDs from a version of the NHM-NHD crosswalk table that includes divergent reaches as well as "zero-area" reaches that don't have associated catchments. But divergences complicate the calculation of network distances, so the adjacency matrix is calculated assuming a dendritic network with no divergences.

The vast majority of the 51 COMIDs listed above are "zero-area" reaches - those reaches are in the distance matrix because they're needed for network navigation, but they don't have climate data because they don't have catchments. Janet suggests that this case - COMIDs in the distance matrix but not in the zarr data - is less of an issue because river-dl just ignores those. However, it causes problems when there are COMIDs in the zarr file that aren't included in the distance matrix (e.g. the 13 COMIDs above).

10 out of those 13 reaches are divergent reaches and the remaining three ("9486550", "9484328", and "8077148") are COMIDs that get dropped from the segment when other divergent reaches along the NHM segment are omitted. In summary, I think for the set of NHD downscaling experiments we should predict on the dendritic network so that the COMIDs in the nhd inputs zarr data match those river-dl is expecting based on the distance matrix.

Adjust pipeline phase names to reflect different sets of experiments?

Currently our pipeline uses the ~default phase structure of 1_fetch and 2_process. However, it might help us keep the targets straight if we renamed the process phase to align with our two sets of model experiments, i.e. 2_process becomes 2_nhd_downscaling and 3_nhm_groundwater.

Outline structure for groundwater data release

Planning steps for the groundwater data release that will accompany the paper that Janet Barclay is leading:

create an excel file that contains datasets that will be included
sketch out structure of data release; do we anticipate needing any child items?

Note that this data release will not include any targets/files created as part of the FY22 NHDv2 downscaling experiments.

Describe pipeline structure in README

Let's aim to update the readme with a quick descriptor that describes the 3 pipeline scripts

Originally posted by @msleckman in #57 (comment)

Compile river-dl input drivers at NHDv2 resolution

Input drivers for river-dl include slope, elevation, and width. We currently have these values for segments in the NHM fabric and need to compile these values for NHDv2 reaches in the DRB.

Check on preferred column names for temperature observations

The 2022 forecasting data release contains both aggregated (by NHM segment-date) and unaggregated temperature observations, both of which have columns for min_temp_c, mean_temp_c, and max_temp_c. As a result, these are the columns we carry through when we create p2_drb_temp_obs_by_comid, which is a data frame containing one row of temperature data per COMID-date.

Janet pointed out that river-dl is expecting a column called temp_c and so to run river-dl she changes the column from mean_temp_c to temp_c. This issue is a reminder to check in with others who run river-dl to see if switching to mean_temp_c in that workflow would cause any issues. If so, we should adjust the column names used in our pipeline to match what river-dl is expecting.

All geospatial switch to projected crs 5070

K-sat attr not found in StatsGo SB release with other soil attr

Missing k-sat attr - refering to soil hydraulic conductivity - from the SB Statsgo data release.

Likely need another resource for k-sat, while other soil characteristics (percent, k-fact, water table depth) are easily accessible in sb data release above.

Combine met driver data with NHDv2-scale attributes

Combine meteorological driver data aggregated at the NHDv2-scale with the segment/catchment attributes in p2_input_drivers_nhd.

Fetch Wieczorek et al. NHDPlusv2 attributes

Download the following datasets from the Wieczorek et al. data releases on ScienceBase:

Other datasets that may be useful but are not higher priority:

bedrock permeability classes (data release)
county-level freshwater withdrawals (data release)
percent land in irrigated agriculture (data release)
NLCD impervious cover (data release)
Hydrologic group A soils (data release)
soil permeability and depth to seasonally high water table (data release)
soil percent sand (data release) Check whether these data have already been downloaded as part of the STATSGO variables

Create feather file that includes gw-relevant attributes and modflow values

River-dl is expecting a feather file that contains all the catchment attributes. We will need to combine the groundwater-relevant features with the MODFLOW outputs that Janet has compiled.

Check/QC NHD-scale gridmet data

We've noticed some weird values in the gridmet sw-rad data when comparing our meteorological drivers between the NHM and NHDPlusv2 scales, notably that there's some values that are close to/equal to zero at the NHD-scale but with larger values at the NHM scale (and vice versa). Why is that? It might be useful to start with the NHD-scale data and investigate where the NHD values equal ~zero.

Systematic differences in width between NHD and NHM

We currently see systematic differences in segment width between the NHM and NHDPlusv2 values (in meters), respectively. In the plot below, seg_width_max is the maximum "empirically-estimated" width for the COMIDs that comprise each NHM segment.

We should inspect these variables to understand why the width estimates differ, especially for wider reaches (e.g. along the Delaware River mainstem). One idea for creating a more "apples-to-apples" comparison between the NHM and NHD widths is to just use our empirical approach and replace seg_width (NHM) with seg_width_max and use those widths as a new catchment attribute in river-dl.

Standardize NHM segment identifiers across catchment attribute targets

The temperature data release includes two segment identifier columns, subsegid and seg_id_nat. There are 459 unique values of subsegid in the DRB and 456 unique values of seg_id_nat. (The difference arises because segidnat's 1437, 1442, 1485 were split during processing for the temperature project. This step is in the delaware-model-prep repo.)

When processing catchment attributes, we're sometimes using one identifier column and sometimes using the other (examples included below). We should decide how many segments we're expecting in the network, and whether we should use subsegid (sometimes referred to in our pipeline as PRMS_segid because of naming conventions used in the inland salinity project) or seg_id_nat to represent unique segments for modeling.

# 1) Here's an example where we use seg_id_nat and therefore end up with 456 segments:
> tar_load(p2_confinement_mcmanamay_filled)
> dim(p2_confinement_mcmanamay_filled)
[1] 456   7
> head(p2_confinement_mcmanamay_filled, 3)
# A tibble: 3 x 7
  seg_id_nat reach_length_km lengthkm_mcmanamay_is_na prop_reach_w_mcmanamay confinement_calc_mcmanamay flag_mcmanamay flag_gaps
  <chr>                <dbl>                    <dbl>                  <dbl>                      <dbl> <chr>          <chr>    
1 1435                  13.6                    0                      1                          12.8  NA             NA       
2 1436                  19.1                    0.518                  0.973                      13.9  NA             NA       
3 1437                  19.6                    0                      1                           8.98 NA             NA  

#2) Here's an example where we use subsegid/PRMS_segid and therefore end up with 459 segments:
> tar_load(p2_soller_coarse_sediment_reaches_nhm)
> dim(p2_soller_coarse_sediment_reaches_nhm)
[1] 459   5
> head(sf::st_drop_geometry(p2_soller_coarse_sediment_reaches_nhm), 3)
# A tibble: 3 x 4
  PRMS_segid total_reach_buffer_area_km2 cs_area_km2 cs_area_proportion
  <chr>                           [km^2]      [km^2]              <dbl>
1 1_1                               6.94           0                  0
2 10_1                              1.24           0                  0
3 11_1                              1.10           0                  0
>

Investigate NA in raster extraction for depth to bedrock

Following GW-HW check-in discussion (07/19), noting 2 observations related to NAs in the processing step for depth to bedrock raster:

1. Complete rasters not always entirely encapsulated within polygon boundary of given reach catchment/buffer.

Example with 100 m buffer:

Further exploration shows that this improves with a larger buffer (500m) - note NA is black in this image

To do: Determine the threshold for counting raster within AOI defined by polygon boundary, if it exists.
Decision: For our pipeline, using 250m buffer seems most appropriate given the res of spatial dataset

2. Some NA's appearing through summarization of raster per polygon.

When na.rm == FALSE, in red are the catchments (top) & buffered_500m reahces (bottom) with NA summarization result for raster value aggregation (width of buffer doesn't change the identified catchments/reaches). So far, they do not appear linked to actually NA raster values.

When na.rm == TRUE, these pixels get a summarization output value other than NA.
- See below buffered reaches (500m) when na.rm== T NA polygons in red are expected here as there is no data on delta.

To do: Determine why certain catchments get aggregated to NA
Decision: Issue more/less resolves itself when na.rm = TRUE, so setting that in pipeline but documenting here for further discussion.

For further exploration, reproducible gist here

Add all catchments per reach target to pipeline

Produce a sf dataframe that links each reach to all catchments that drain to that reach

Inputs:

NHD catchments: either form NHD catchment gkpg from drb-network-prep or pulling directly with get_nhdplys(comid = p1_nhd_reaches$comid, realization = 'catchment')
p1_drb_comids_all_tribs

Process: join two inputs using featureid and comid linkage and disolve such that each reach is tied to single catchment polygon

Use for the depth to bedrock processing and any other raster processing that requires raster value summarization within such catchment area.

Compile air temp data for NHD and NHM scales?

For the downscaling experiments we need to compile meteorological drivers, including daily air temperature, at the NHD-resolution using gridmet data. For the NHM scale, air temperature (seg_tave_air) is the daily average air temperature from PRMS-SNTemp, but gridmet only contains daily min/max values. Using either the daily min or max results in a systemic bias in the raw air temperature values between the NHD and NHM scales, as shown in the plots below.

In #19, we used the average of the daily min and max temperatures from gridmet to make the values more comparable between NHD and NHM. However, given the importance of air temperature for water temperature predictions, we may want to take a further step of compiling temp data in the same way for the NHM-scale (i.e., use the data aggregated from gridmet rather than using the SNTemp values). If we do that, we'll need to rename the variables so that the names are shared between the NHD and NHM scales but differ from the naming used in PRMS/SNTemp (e.g. use seg_airtemp_gridmet).

tidy netcdf with xarray

@lekoenig the subset_nc_to_comid.py processes slowly for me in our targets pipeline. This may tied to reticulate, but one idea I have is to simplify the ds_to_dataframe_faster() and replace it with xarray.to_dataframe(). Have you tried this function? It's more of a "blackbox" xarray tidying function that does what you have built with ds_to_dataframe_faster() but it processes the netcdf quite a bit faster (see time tests below - i tested across different time frames to see how it scales).

If the results below work for you, I can go ahead and make and test this code modification in my branch addressing #37.

# function from subset_nc_to_comid.py 
def ds_to_dataframe_faster(ds):
    """
    doing this to try to avoid the multi-index joins
    """
    series_list = []
    for var in ds.data_vars:
        if "lat" not in var and "lon" not in var:
            df_var = ds[var].to_pandas().reset_index()
            df_var_tidy = df_var.melt(id_vars='COMID', value_name=var)
            series_list.append(df_var_tidy[var])
    series_list.append(df_var_tidy['COMID'])
    series_list.append(df_var_tidy['time'])
    return pd.concat(series_list, axis=1)

# timing function
def time_xr_tidying(ds):
    start = process_time()
    ds_fun = ds_to_dataframe_faster(ds)
    end = process_time()
    print('elapsed time w/function:', end - start)

    start = process_time()
    ds_df = ds.to_dataframe().reset_index()
    end = process_time()
    print('elapsed time w/ .to_dataframe():', end - start)


# time testing
nc_file  = 'drb_climate_2022_06_14.nc'
ds = xr.open_dataset(nc_file, decode_times=True)

# subset - 1 yr
ds_1yr = ds.sel(time = slice('1979-01-01','1980-01-01'))
# subset - 10 yr
ds_10yr = ds.sel(time = slice('1979-01-01','1989-01-01'))
# subset - 20 yr
ds_20yr = ds.sel(time = slice('1979-01-01','1999-01-01'))

> time_xr_tidying(ds_1yr)
elapsed time w/function: 8.313794999999999
elapsed time w/ .to_dataframe(): 0.9057740000000081

> time_xr_tidying(ds_10yr)
elapsed time w/function: 112.05785399999999
elapsed time w/ .to_dataframe(): 33.176807

> time_xr_tidying(ds_20yr)
elapsed time w/function: 233.733969
elapsed time w/ .to_dataframe(): 86.80633

output summary

The output are identical (the to_dataframe() process keeps hru_lat and hur_lon columns so shapes differ in the number of cols only (ds_df.shape = (5904312, 12) vs. ds_fun.shape = (5904312, 10))

Shown below are summary stats + example plot for 1 var. All other variables had same output on terms of both datasets being identical.

> grouped_ds_df = ds_df.groupby('time').mean()
> grouped_ds_fun = ds_fun.groupby('time').mean()
> print(grouped_ds_fun.drop('COMID', axis = 1).describe())
> print(grouped_ds_df.drop(['COMID','hru_lat','hru_lon'], axis = 1).describe())

             tmmx        tmmn          pr  ...        rmax        rmin         sph
count  366.000000  366.000000  366.000000  ...  366.000000  366.000000  366.000000
mean    60.254989   40.821050    0.150591  ...   85.017617   45.119070    0.006339
std     18.341400   17.193906    0.277693  ...   12.213456    8.979670    0.003999
min     10.863673  -13.021575    0.000000  ...   43.492542   15.085517    0.000502
25%     46.622133   28.374595    0.001501  ...   77.554284   39.408278    0.002895
50%     62.686404   41.535074    0.030448  ...   87.578144   45.064299    0.005557
75%     75.340224   54.036601    0.156937  ...   95.402246   52.055688    0.009510
max     88.576795   70.733331    1.751408  ...  100.000000   67.168022    0.015907
[8 rows x 8 columns]
             tmmx        tmmn          pr  ...        rmax        rmin         sph
count  366.000000  366.000000  366.000000  ...  366.000000  366.000000  366.000000
mean    60.254989   40.821050    0.150591  ...   85.017617   45.119070    0.006339
std     18.341400   17.193906    0.277693  ...   12.213456    8.979670    0.003999
min     10.863673  -13.021575    0.000000  ...   43.492542   15.085517    0.000502
25%     46.622133   28.374595    0.001501  ...   77.554284   39.408278    0.002895
50%     62.686404   41.535074    0.030448  ...   87.578144   45.064299    0.005557
75%     75.340224   54.036601    0.156937  ...   95.402246   52.055688    0.009510
max     88.576795   70.733331    1.751408  ...  100.000000   67.168022    0.015907
[8 rows x 8 columns]

Standardize function documentation

Fix all functions so that documention is found above first line of the function.

Originally posted by @msleckman in #43 (comment)

Python file not invalidating in targets

I've noticed an issue where changes to a python file (2_proces/src/subset_nc_to_comid) don't seem to invalidate the corresponding target, causing downstream targets to not rebuild when they should. One fix may be to route the python file path through a table that calculates a hash for that file (ht Anthony and Julie).

reach without catchment

Reach COMID: 4188275 (len = 48 m) does not have an associated catchment from nhdplustools

Zoom out:

Should we remove this reach? For task in #18 assuming we are going the route of capturing depth to bedrock within the reach's buffer area, the buffering of this reach works and ensures we have a encircling polygon tied to this reach. But down the line, we may run into issues with this reach.

Compile channel confinement data

Options include the FACET dataset (Hopkins et al. 2020) which is DRB-specific, and the McManamay and DeRolph national dataset which is indexed by COMID.

Use date range to restrict NWIS widths download

The function summarize_nwis_widths() has optional arguments to set the date range for pulling the width data. However, we don't use those arguments in estimate_mean_width(). The implication is that different users will get different estimates, depending on the date they built the pipeline and pulled the widths down from NWIS.

For reproducibility, we should consider using date ranges in estimate_mean_width.

Make sure NHD-scale and NHM-scale gridmet drivers use the same units

There appear to be some discrepancies between the gridmet driver data compiled at the NHDv2 and NHM-scales. The NHM data in the graphs below is coming from uncal_sntemp_input_output.zip here. For the NHM-scale data, seg_rain is in units of meters and seg_tave_air is in units of degrees Celsius. Some of this discrepancy may be due to a difference in units since gridmet data is in units of degrees Farenheit and precip is in units of inches.

Add data related to segment canopy cover/shade?

From Janet:

[Shade could be a] factor that might mediate the influence of atmospheric conditions vs groundwater discharge on temperature. Anyways, if we have reliable estimates that are easy to incorporate, it might be worth adding that to our list of data to pull.

Some ideas:

For the NHM fabric, PRMS input attributes (seg_shade, covden_sum, covden_win)?
For either the NHM/NHDv2 fabric, NLCD-based estimates of percent forest cover in some riparian buffer zone?

Match temperature sites to NHDv2 flowlines

We'll need a target that matches the DRB temperature site locations to NHDv2 flowline reaches.

nhd observation file columns names

In the zarr file of nhd resolution observations drb_temp_observations_nhdv2.zarr the time column should be renamed to date (the temporal coordinate is already date)

Map temp observations onto NHDPlusv2 COMIDs

The observation sites are snapped to NHDPlusv2 flowlines in p2_drb_temp_sites_w_segs. It would be helpful to also have a target that outputs a data frame with the observational time series but with COMID as an additional column. To match the time series with the COMID we'd need to join p1_drb_temp_obs and p2_drb_temp_sites_w_segs by the site_id.

From Janet:

[we can] export either a csv or a netcdf with the appropriate columns named the same as the aggregated temps in the new data release (except with the addition of of COMID), I could use the snakemake file to reshape it. (or I could get you the specs of what river-dl wants and you could do that in the pipeline and export the zarr)

Omit `seg_id_nat` 3558 from output table

In the DRB, seg_id_nat == 3558 is dropped from the river-dl pipeline due to really weird PRMS and SNTemp numbers (see river-dl issue here). To avoid confusion regarding the expected number of river segments, I suggest we drop this segment from the output table in p2b_static_inputs_nhm_combined.

Output `p2_input_drivers_nhd` as xarray

I didn't think of this until now. The way river-dl is written, it expects an xarray for the main input drivers (and a feather file for the additional catchment attributes). The python data-prep script I have uses .to_zarr() to export the files. I dug around a little and didn't find any super obvious options for exporting the prepped data from R as the xarray file. It would be ideal to have the reformatting within this pipeline, but maybe we need to export it from R as a feather (or netCDF?) file and then reformat it using python? Thoughts?

Originally posted by @janetrbarclay in #12 (comment)

add seg_tave_water to PRMS outputs at NHD resolution

To be able to pretrain at the NHD resolution, we need to have seg_tave_water added to the sntemp outputs that are compiled at the NHD resolution (what becomes nhdv2_inputs_io.zarr)

Fetch temperature data from ScienceBase

The code in this pipeline depends on output data from an "in process" data release on ScienceBase. Add functionality to download those data directly from ScienceBase. See Lindsay's workflow here.

trouble downloading study_monitoring_sites.zip with sbtools

Issue downloading study_monitoring_sites.zip from https://www.sciencebase.gov/catalog/item/623e54c4d34e915b67d83580

sbtools::item_file_download(sb_id = "623e54c4d34e915b67d83580", names = "study_monitoring_sites.zip", destinations = "1_fetch/in", overwrite_file = TRUE)
Error in sbtools::item_file_download(sb_id = "623e54c4d34e915b67d83580",  : 
  623e54c4d34e915b67d83580Item does not contain all requested files

Likely due to the display of the study_monitoring_sites files being unzipped in the release. Note that no errors occur when using sciencebasepy

Need to find work around for this pipeline to successfully run 1_fetch.R#L41-L49 and build target p1_drb_temp_sites_shp

Process Wieczorek attributes to NHM-scale

We'll need to process the Wieczorek datasets downloaded in #39 to the NHM-scale. We will adopt code from USGS-R/drb-inland-salinity-ml to do this.

This issue involves the following sub-tasks:

For each NHM segment, gather the NHDv2 attributes for the total upstream watershed. This is relatively simple because we can use the comid_down column from the NHM-NHDv2 crosswalk table to pull the attribute values for only those COMIDs that represent the downstream terminus of each NHM segment.
For each attribute/variable in 1_fetch/in/nhdv2_attributes_from_sciencebase.csv, we need to specify a new column called CAT_aggregation_operation which tells us how we should process that attribute when we're moving from the NHDPlusv2-scale for which these attributes were computed to the NHM-scale which (often) encompasses multiple NHDPlusv2 catchments.
For each NHM segment, use the catchment aggregation specified in the step above to process the NHDv2-scale attributes to the NHM-scale.
Combine upstream- and catchment-scale attributes into a single data frame with one row per NHM seg id.

	# Read in meteorological data aggregated to NHDPlusV2 catchments for the
	# DRB (prepped in https://github.com/USGS-R/drb_gridmet_tools). Note that
	# the DRB met data file must be stored in 1_fetch/in. If working outside
	# of tallgrass/caldera, this file will need to be downloaded from the
	# PGDL-DO project's S3 bucket and manually placed in 1_fetch/in.
	tar_target(
	p1_drb_nhd_gridmet,
	"1_fetch/in/drb_climate_2022_06_14.nc",
	format = "file"
	),