usgs-r / drb-gw-hw-model-prep Goto Github PK
View Code? Open in Web Editor NEWCode repo to prepare groundwater and headwater-related datasets for modeling river temperature in the Delaware River Basin
License: Other
Code repo to prepare groundwater and headwater-related datasets for modeling river temperature in the Delaware River Basin
License: Other
File is 15GB.
Need to pull drb_climate_2022_06_14.nc
more efficiently. Let's explore placing it in an S3 bucket that we can request directly in 1_fetch.R
drb-gw-hw-model-prep/1_fetch.R
Lines 177 to 186 in f4ab37a
From @janetrbarclay:
We don't use q_std, q_std_per and nDown. They could be removed from the feather file
This came up in response to a question from Margaux when creating the data dictionary.
On second look, there appear to be some errors with running the adjusted NHM widths from #30 through river-dl.
From @janetrbarclay:
When testing the zarr data store in #19, Janet noticed that there's a mismatch in the COMIDs present in the distance matrix (from drb-network-prep) and those present in p2_input_drivers_nhd
, the data frame that gets written to zarr and contains the input drivers for our NHD-downscaling version of the model.
There are 13 reaches in the zarr file that are not in the distance matrix:
[1748579, 8076904, 8077128, 8077146, 8077148, 8077154, 8077164, 8077166, 9484328, 9484338, 9484406, 9484460, 9486550]
And there are 51 reaches in the distance matrix that are not in the driver data:
[1748581, 1748739, 1750633, 2585051, 2587011, 2588371, 2614050, 2615272, 2615284, 2743164, 2743508, 2743520, 2743522, 4148042, 4150536, 4151818, 4153726, 4186295, 4186425, 4188023, 4188041, 4188275, 4489216, 4490988, 4492442, 4492448, 4495872, 4495884, 4495910, 4496232, 4496410, 4496424, 4496436, 4496460, 4648744, 4648766, 4652082, 4655446, 4782271, 4782507, 4782567, 4783437, 4783475, 8076898, 8077118, 8077572, 8077738, 9481952, 9481980, 9482048, 9485884]
This happens because we're currently grabbing the COMIDs from a version of the NHM-NHD crosswalk table that includes divergent reaches as well as "zero-area" reaches that don't have associated catchments. But divergences complicate the calculation of network distances, so the adjacency matrix is calculated assuming a dendritic network with no divergences.
The vast majority of the 51 COMIDs listed above are "zero-area" reaches - those reaches are in the distance matrix because they're needed for network navigation, but they don't have climate data because they don't have catchments. Janet suggests that this case - COMIDs in the distance matrix but not in the zarr data - is less of an issue because river-dl just ignores those. However, it causes problems when there are COMIDs in the zarr file that aren't included in the distance matrix (e.g. the 13 COMIDs above).
10 out of those 13 reaches are divergent reaches and the remaining three ("9486550", "9484328", and "8077148") are COMIDs that get dropped from the segment when other divergent reaches along the NHM segment are omitted. In summary, I think for the set of NHD downscaling experiments we should predict on the dendritic network so that the COMIDs in the nhd inputs zarr data match those river-dl is expecting based on the distance matrix.
Currently our pipeline uses the ~default phase structure of 1_fetch
and 2_process
. However, it might help us keep the targets straight if we renamed the process phase to align with our two sets of model experiments, i.e. 2_process
becomes 2_nhd_downscaling
and 3_nhm_groundwater
.
Planning steps for the groundwater data release that will accompany the paper that Janet Barclay is leading:
Note that this data release will not include any targets/files created as part of the FY22 NHDv2 downscaling experiments.
Let's aim to update the readme with a quick descriptor that describes the 3 pipeline scripts
Originally posted by @msleckman in #57 (comment)
Input drivers for river-dl include slope, elevation, and width. We currently have these values for segments in the NHM fabric and need to compile these values for NHDv2 reaches in the DRB.
The 2022 forecasting data release contains both aggregated (by NHM segment-date) and unaggregated temperature observations, both of which have columns for min_temp_c
, mean_temp_c
, and max_temp_c
. As a result, these are the columns we carry through when we create p2_drb_temp_obs_by_comid
, which is a data frame containing one row of temperature data per COMID-date.
Janet pointed out that river-dl is expecting a column called temp_c
and so to run river-dl she changes the column from mean_temp_c
to temp_c
. This issue is a reminder to check in with others who run river-dl to see if switching to mean_temp_c
in that workflow would cause any issues. If so, we should adjust the column names used in our pipeline to match what river-dl is expecting.
Missing k-sat attr - refering to soil hydraulic conductivity - from the SB Statsgo data release.
Likely need another resource for k-sat, while other soil characteristics (percent, k-fact, water table depth) are easily accessible in sb data release above.
Combine meteorological driver data aggregated at the NHDv2-scale with the segment/catchment attributes in p2_input_drivers_nhd
.
Download the following datasets from the Wieczorek et al. data releases on ScienceBase:
OLSON_PERM
(data release)Other datasets that may be useful but are not higher priority:
River-dl is expecting a feather file that contains all the catchment attributes. We will need to combine the groundwater-relevant features with the MODFLOW outputs that Janet has compiled.
We've noticed some weird values in the gridmet sw-rad data when comparing our meteorological drivers between the NHM and NHDPlusv2 scales, notably that there's some values that are close to/equal to zero at the NHD-scale but with larger values at the NHM scale (and vice versa). Why is that? It might be useful to start with the NHD-scale data and investigate where the NHD values equal ~zero.
We currently see systematic differences in segment width between the NHM and NHDPlusv2 values (in meters), respectively. In the plot below, seg_width_max
is the maximum "empirically-estimated" width for the COMIDs that comprise each NHM segment.
We should inspect these variables to understand why the width estimates differ, especially for wider reaches (e.g. along the Delaware River mainstem). One idea for creating a more "apples-to-apples" comparison between the NHM and NHD widths is to just use our empirical approach and replace seg_width
(NHM) with seg_width_max
and use those widths as a new catchment attribute in river-dl.
The temperature data release includes two segment identifier columns, subsegid
and seg_id_nat
. There are 459 unique values of subsegid
in the DRB and 456 unique values of seg_id_nat
. (The difference arises because segidnat's 1437, 1442, 1485 were split during processing for the temperature project. This step is in the delaware-model-prep repo.)
When processing catchment attributes, we're sometimes using one identifier column and sometimes using the other (examples included below). We should decide how many segments we're expecting in the network, and whether we should use subsegid
(sometimes referred to in our pipeline as PRMS_segid
because of naming conventions used in the inland salinity project) or seg_id_nat
to represent unique segments for modeling.
# 1) Here's an example where we use seg_id_nat and therefore end up with 456 segments:
> tar_load(p2_confinement_mcmanamay_filled)
> dim(p2_confinement_mcmanamay_filled)
[1] 456 7
> head(p2_confinement_mcmanamay_filled, 3)
# A tibble: 3 x 7
seg_id_nat reach_length_km lengthkm_mcmanamay_is_na prop_reach_w_mcmanamay confinement_calc_mcmanamay flag_mcmanamay flag_gaps
<chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 1435 13.6 0 1 12.8 NA NA
2 1436 19.1 0.518 0.973 13.9 NA NA
3 1437 19.6 0 1 8.98 NA NA
#2) Here's an example where we use subsegid/PRMS_segid and therefore end up with 459 segments:
> tar_load(p2_soller_coarse_sediment_reaches_nhm)
> dim(p2_soller_coarse_sediment_reaches_nhm)
[1] 459 5
> head(sf::st_drop_geometry(p2_soller_coarse_sediment_reaches_nhm), 3)
# A tibble: 3 x 4
PRMS_segid total_reach_buffer_area_km2 cs_area_km2 cs_area_proportion
<chr> [km^2] [km^2] <dbl>
1 1_1 6.94 0 0
2 10_1 1.24 0 0
3 11_1 1.10 0 0
>
Following GW-HW check-in discussion (07/19), noting 2 observations related to NAs in the processing step for depth to bedrock raster:
Further exploration shows that this improves with a larger buffer (500m) - note NA is black in this image
To do: Determine the threshold for counting raster within AOI defined by polygon boundary, if it exists.
Decision: For our pipeline, using 250m buffer seems most appropriate given the res of spatial dataset
na.rm == FALSE
, in red are the catchments (top) & buffered_500m reahces (bottom) with NA summarization result for raster value aggregation (width of buffer doesn't change the identified catchments/reaches). So far, they do not appear linked to actually NA raster values.na.rm == TRUE
, these pixels get a summarization output value other than NA.
na.rm== T
NA polygons in red are expected here as there is no data on delta.To do: Determine why certain catchments get aggregated to NA
Decision: Issue more/less resolves itself when na.rm = TRUE
, so setting that in pipeline but documenting here for further discussion.
For further exploration, reproducible gist here
Produce a sf dataframe that links each reach to all catchments that drain to that reach
Inputs:
get_nhdplys(comid = p1_nhd_reaches$comid, realization = 'catchment')
p1_drb_comids_all_tribs
Process: join two inputs using featureid and comid linkage and disolve such that each reach is tied to single catchment polygon
Use for the depth to bedrock processing and any other raster processing that requires raster value summarization within such catchment area.
For the downscaling experiments we need to compile meteorological drivers, including daily air temperature, at the NHD-resolution using gridmet data. For the NHM scale, air temperature (seg_tave_air
) is the daily average air temperature from PRMS-SNTemp, but gridmet only contains daily min/max values. Using either the daily min or max results in a systemic bias in the raw air temperature values between the NHD and NHM scales, as shown in the plots below.
In #19, we used the average of the daily min and max temperatures from gridmet to make the values more comparable between NHD and NHM. However, given the importance of air temperature for water temperature predictions, we may want to take a further step of compiling temp data in the same way for the NHM-scale (i.e., use the data aggregated from gridmet rather than using the SNTemp values). If we do that, we'll need to rename the variables so that the names are shared between the NHD and NHM scales but differ from the naming used in PRMS/SNTemp (e.g. use seg_airtemp_gridmet
).
@lekoenig the subset_nc_to_comid.py
processes slowly for me in our targets pipeline. This may tied to reticulate, but one idea I have is to simplify the ds_to_dataframe_faster()
and replace it with xarray.to_dataframe(). Have you tried this function? It's more of a "blackbox" xarray tidying function that does what you have built with ds_to_dataframe_faster()
but it processes the netcdf quite a bit faster (see time tests below - i tested across different time frames to see how it scales).
If the results below work for you, I can go ahead and make and test this code modification in my branch addressing #37.
# function from subset_nc_to_comid.py
def ds_to_dataframe_faster(ds):
"""
doing this to try to avoid the multi-index joins
"""
series_list = []
for var in ds.data_vars:
if "lat" not in var and "lon" not in var:
df_var = ds[var].to_pandas().reset_index()
df_var_tidy = df_var.melt(id_vars='COMID', value_name=var)
series_list.append(df_var_tidy[var])
series_list.append(df_var_tidy['COMID'])
series_list.append(df_var_tidy['time'])
return pd.concat(series_list, axis=1)
# timing function
def time_xr_tidying(ds):
start = process_time()
ds_fun = ds_to_dataframe_faster(ds)
end = process_time()
print('elapsed time w/function:', end - start)
start = process_time()
ds_df = ds.to_dataframe().reset_index()
end = process_time()
print('elapsed time w/ .to_dataframe():', end - start)
# time testing
nc_file = 'drb_climate_2022_06_14.nc'
ds = xr.open_dataset(nc_file, decode_times=True)
# subset - 1 yr
ds_1yr = ds.sel(time = slice('1979-01-01','1980-01-01'))
# subset - 10 yr
ds_10yr = ds.sel(time = slice('1979-01-01','1989-01-01'))
# subset - 20 yr
ds_20yr = ds.sel(time = slice('1979-01-01','1999-01-01'))
> time_xr_tidying(ds_1yr)
elapsed time w/function: 8.313794999999999
elapsed time w/ .to_dataframe(): 0.9057740000000081
> time_xr_tidying(ds_10yr)
elapsed time w/function: 112.05785399999999
elapsed time w/ .to_dataframe(): 33.176807
> time_xr_tidying(ds_20yr)
elapsed time w/function: 233.733969
elapsed time w/ .to_dataframe(): 86.80633
The output are identical (the to_dataframe() process keeps hru_lat
and hur_lon
columns so shapes differ in the number of cols only (ds_df.shape = (5904312, 12)
vs. ds_fun.shape = (5904312, 10)
)
Shown below are summary stats + example plot for 1 var. All other variables had same output on terms of both datasets being identical.
> grouped_ds_df = ds_df.groupby('time').mean()
> grouped_ds_fun = ds_fun.groupby('time').mean()
> print(grouped_ds_fun.drop('COMID', axis = 1).describe())
> print(grouped_ds_df.drop(['COMID','hru_lat','hru_lon'], axis = 1).describe())
tmmx tmmn pr ... rmax rmin sph
count 366.000000 366.000000 366.000000 ... 366.000000 366.000000 366.000000
mean 60.254989 40.821050 0.150591 ... 85.017617 45.119070 0.006339
std 18.341400 17.193906 0.277693 ... 12.213456 8.979670 0.003999
min 10.863673 -13.021575 0.000000 ... 43.492542 15.085517 0.000502
25% 46.622133 28.374595 0.001501 ... 77.554284 39.408278 0.002895
50% 62.686404 41.535074 0.030448 ... 87.578144 45.064299 0.005557
75% 75.340224 54.036601 0.156937 ... 95.402246 52.055688 0.009510
max 88.576795 70.733331 1.751408 ... 100.000000 67.168022 0.015907
[8 rows x 8 columns]
tmmx tmmn pr ... rmax rmin sph
count 366.000000 366.000000 366.000000 ... 366.000000 366.000000 366.000000
mean 60.254989 40.821050 0.150591 ... 85.017617 45.119070 0.006339
std 18.341400 17.193906 0.277693 ... 12.213456 8.979670 0.003999
min 10.863673 -13.021575 0.000000 ... 43.492542 15.085517 0.000502
25% 46.622133 28.374595 0.001501 ... 77.554284 39.408278 0.002895
50% 62.686404 41.535074 0.030448 ... 87.578144 45.064299 0.005557
75% 75.340224 54.036601 0.156937 ... 95.402246 52.055688 0.009510
max 88.576795 70.733331 1.751408 ... 100.000000 67.168022 0.015907
[8 rows x 8 columns]
Fix all functions so that documention is found above first line of the function.
Originally posted by @msleckman in #43 (comment)
I've noticed an issue where changes to a python file (2_proces/src/subset_nc_to_comid
) don't seem to invalidate the corresponding target, causing downstream targets to not rebuild when they should. One fix may be to route the python file path through a table that calculates a hash for that file (ht Anthony and Julie).
Reach COMID: 4188275
(len = 48 m) does not have an associated catchment from nhdplustools
Zoom out:
Should we remove this reach? For task in #18 assuming we are going the route of capturing depth to bedrock within the reach's buffer area, the buffering of this reach works and ensures we have a encircling polygon tied to this reach. But down the line, we may run into issues with this reach.
Options include the FACET dataset (Hopkins et al. 2020) which is DRB-specific, and the McManamay and DeRolph national dataset which is indexed by COMID.
The function summarize_nwis_widths()
has optional arguments to set the date range for pulling the width data. However, we don't use those arguments in estimate_mean_width()
. The implication is that different users will get different estimates, depending on the date they built the pipeline and pulled the widths down from NWIS.
For reproducibility, we should consider using date ranges in estimate_mean_width
.
There appear to be some discrepancies between the gridmet driver data compiled at the NHDv2 and NHM-scales. The NHM data in the graphs below is coming from uncal_sntemp_input_output.zip here. For the NHM-scale data, seg_rain
is in units of meters and seg_tave_air
is in units of degrees Celsius. Some of this discrepancy may be due to a difference in units since gridmet data is in units of degrees Farenheit and precip is in units of inches.
From Janet:
[Shade could be a] factor that might mediate the influence of atmospheric conditions vs groundwater discharge on temperature. Anyways, if we have reliable estimates that are easy to incorporate, it might be worth adding that to our list of data to pull.
Some ideas:
seg_shade
, covden_sum
, covden_win
)?We'll need a target that matches the DRB temperature site locations to NHDv2 flowline reaches.
In the zarr file of nhd resolution observations drb_temp_observations_nhdv2.zarr
the time
column should be renamed to date
(the temporal coordinate is already date
)
The observation sites are snapped to NHDPlusv2 flowlines in p2_drb_temp_sites_w_segs
. It would be helpful to also have a target that outputs a data frame with the observational time series but with COMID as an additional column. To match the time series with the COMID we'd need to join p1_drb_temp_obs
and p2_drb_temp_sites_w_segs
by the site_id
.
From Janet:
[we can] export either a csv or a netcdf with the appropriate columns named the same as the aggregated temps in the new data release (except with the addition of of COMID), I could use the snakemake file to reshape it. (or I could get you the specs of what river-dl wants and you could do that in the pipeline and export the zarr)
In the DRB, seg_id_nat == 3558
is dropped from the river-dl pipeline due to really weird PRMS and SNTemp numbers (see river-dl issue here). To avoid confusion regarding the expected number of river segments, I suggest we drop this segment from the output table in p2b_static_inputs_nhm_combined
.
I didn't think of this until now. The way river-dl is written, it expects an xarray for the main input drivers (and a feather file for the additional catchment attributes). The python data-prep script I have uses .to_zarr() to export the files. I dug around a little and didn't find any super obvious options for exporting the prepped data from R as the xarray file. It would be ideal to have the reformatting within this pipeline, but maybe we need to export it from R as a feather (or netCDF?) file and then reformat it using python? Thoughts?
Originally posted by @janetrbarclay in #12 (comment)
To be able to pretrain at the NHD resolution, we need to have seg_tave_water added to the sntemp outputs that are compiled at the NHD resolution (what becomes nhdv2_inputs_io.zarr
)
The code in this pipeline depends on output data from an "in process" data release on ScienceBase. Add functionality to download those data directly from ScienceBase. See Lindsay's workflow here.
Issue downloading study_monitoring_sites.zip from https://www.sciencebase.gov/catalog/item/623e54c4d34e915b67d83580
sbtools::item_file_download(sb_id = "623e54c4d34e915b67d83580", names = "study_monitoring_sites.zip", destinations = "1_fetch/in", overwrite_file = TRUE)
Error in sbtools::item_file_download(sb_id = "623e54c4d34e915b67d83580", :
623e54c4d34e915b67d83580Item does not contain all requested files
Likely due to the display of the study_monitoring_sites files being unzipped in the release. Note that no errors occur when using sciencebasepy
Need to find work around for this pipeline to successfully run 1_fetch.R#L41-L49 and build target p1_drb_temp_sites_shp
We'll need to process the Wieczorek datasets downloaded in #39 to the NHM-scale. We will adopt code from USGS-R/drb-inland-salinity-ml to do this.
This issue involves the following sub-tasks:
comid_down
column from the NHM-NHDv2 crosswalk table to pull the attribute values for only those COMIDs that represent the downstream terminus of each NHM segment.1_fetch/in/nhdv2_attributes_from_sciencebase.csv
, we need to specify a new column called CAT_aggregation_operation
which tells us how we should process that attribute when we're moving from the NHDPlusv2-scale for which these attributes were computed to the NHM-scale which (often) encompasses multiple NHDPlusv2 catchments.A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.