ideas-lab-nus / epwshiftr Goto Github PK

View Code? Open in Web Editor NEW

30.0 4.0 6.0 1.88 MB

Create future EnergyPlus Weather files using CMIP6 data

Home Page: https://ideas-lab-nus.github.io/epwshiftr/

License: Other

R 98.14% TeX 1.86%

climate-change energyplus epw esgf cmip6

epwshiftr's People

Contributors

Stargazers

Watchers

Forkers

rpkgs tanxuejin stone-d-chen isaakbm yiyuan1840 liushlcn

epwshiftr's Issues

Code coverage dropped because of morphing tests were not run

The code coverage dropped down to ~63%. This is because that morphing related tests were not run as the .fst file from test results of extract_data() were not ready

Create wget download script using ESGF RESTful API

ESGF RESTful api has the capability to generate wget scripts to download files. See: Download data from ESGF using wget.

Support to remove loaded CMIP6

`data_node` input in `esgf_query()` did not work

When execute a search that targets specific data nodes, the shards parameter should be used instead of data_node. Ref: Shard Queries

joss-review documentation of return value (data.table) likely wrong/outdated

I found the the description of the data.table output in the help differs from what I get, when I run the two functions below. Please update.

extract_data()
morphing_epw()

init_cmip6_index Assertion on activity failed source NULL

The init_cmip6_index(..., source=NULL) usage seems broken:

packageDescription("epwshiftr")[c("Version", "Built")]
# 0.1.4
# R 4.1.2; ; 2024-04-12 10:18:46 UTC; unix
dt <- epwshiftr::init_cmip6_index(activity="CMIP", experiment="historical", variable="tas", frequency="mon") # ok
dt <- epwshiftr::init_cmip6_index(activity="CMIP", experiment="historical", variable="tas", frequency="mon", source=NULL) # error
Error in esgf_query(activity = unique(q$activity_drs), variable = unique(q$variable_id),  : 
  Assertion on 'activity' failed: Must be a subset of {'AerChemMIP','C4MIP','CDRMIP','CFMIP','CMIP','CORDEX','DAMIP','DCPP','DynVarMIP','FAFMIP','GMMIP','GeoMIP','HighResMIP','ISMIP6','LS3MIP','LUMIP','OMIP','PAMIP','PMIP','RFMIP','SIMIP','ScenarioMIP','VIACSAB','VolMIP'}, but has additional elements {'E3SM-Project','CAS'}.

Cheers,
Chris

Add CITATION

Use GitHub Actions with cache to run full tests

Currently, most of the functions are tested locally, as they all need NetCDF files which can be of quite a large size. We can download a small amount of NetCDF files and use GitHub cache to reuse them between workflows.

Error using summary_database()

Error using summary_database()
Hi, I am trying to use the tool and receiving this warning message while using the summary_database() . I have copied the file in the same directory as the cmip6_index.csv. However, I am getting the following warning message.

In addition: Warning message:
Case(s) shown below does not matche any NetCDF file in the database. Please make sure all needed NetCDF files listed in the file index have been downloaded and placed in the database.
#1 | For case 'CMIP6.ScenarioMIP.AWI.AWI-CM-1-1-MR.ssp245.r1i1p1f1.day.tas.gn.v20190529.tas_day_AWI-CM-1-1-MR_ssp245_r1i1p1f1_gn_20340101-20341231.nc':

#2 | For case 'CMIP6.ScenarioMIP.AWI.AWI-CM-1-1-MR.ssp245.r1i1p1f1.day.tas.gn.v20190529.tas_day_AWI-CM-1-1-MR_ssp245_r1i1p1f1_gn_20350101-20351231.nc':

#3 | For case 'CMIP6.ScenarioMIP.AWI.AWI-CM-1-1-MR.ssp245.r1i1p1f1.day.tas.gn.v20190529.tas_day_AWI-CM-1-1-MR_ssp245_r1i1p1f1_gn_20360101-20361231.nc':

This leads to the following error in the subsequent steps:

coord$coord[, .(file_path, coord)]
Empty data.table (0 rows and 2 cols): file_path,coord
> str(coord$coord$coord[[1]])
Error in coord$coord$coord[[1]] : subscript out of bounds

The name of a sample .nc file :

tas_day_AWI-CM-1-1-MR_ssp126_r1i1p1f1_gn_20350101-20351231

esgf_query() did not accept empty resolution

library(epwshiftr)

esgf_query(variable = "tas", resolution = NULL)
#> No matched data. Please check network connection and the availability of LLNL ESGF node.
#> Null data.table (0 rows and 0 cols)

^{Created on 2021-02-01 by the reprex package (v0.3.0)}

Refactor `esgf_query()` to enable a step-by-step query builder

esgf_query() can only send a query with very strict constrains. Right now, it is impossible to build temporal coverage queries. Better to build a step-by-step query builder like httr2

Failed to build PDF manual

Should use aligned environment instead of align

! LaTeX Error: Environment align undefined.

See the LaTeX manual or LaTeX Companion for explanation.
Type  H <return>  for immediate help.
 ...

l.3884 }{}

Consider Southern Hemisphere latitudes

Hi! Thanks for this tool, it is quite cool.

I have been testing this tool on some southern-hemisphere climates, and it seems to be that, by default, the weather generated starts in Winter (i.e., Northern Hemisphere, where 1st of Jan is winter). I way this because, when plotting the dry bulb temperatures, I get the following:

Whereas, when I reorganize as follows:

tic = 142*24 # June 21
tac = 266*24 # Sept 23
late_summer = list(range(0,tic))
winter = list(range(tic,tac))
early_summer = list(range(tac,8760))
nhemisphere_temp = epw.data['dry_bulb_temperature']
late_summer_data = list(nhemisphere_temp[late_summer])
winter_data = list(nhemisphere_temp[winter])
early_summer_data = list(nhemisphere_temp[early_summer])
transformed = early_summer_data + late_summer_data + winter_data
epw.data['dry_bulb_temperature'] = transformed

Then I get the following:

Is this a bug? Is there an option for this? Can it be triggered based on the latitude of the EPW?

Best!

joss-review pkg readme

Missing statement of need.
Please consider adding an example on how to download the CMIP6 files, if you already have the URL and data directory (https://github.com/ideas-lab-nus/epwshiftr#manage-cmip6-output-files).
Also for the epw file, you can add the URL you gave me in openjournals/joss-reviews#4030 (comment) I just saw, you have in the example section of match_coord() the code how to download an epw file. Maybe just put the same in the get started section.
The contribute section needs to be improved. See also: https://joss.readthedocs.io/en/latest/review_criteria.html#community-guidelines You can also check how other packages in JOSS have dealt with this.

Use OPeNDAP for remote data subseting

OPeNAP makes it possible to directly subset the GCM data directly in the data nodes. This approach can avoid downloading GBs of NetCDF files locally. See DAP2 and DAP4 Protocol Services

`summary_database()` did not inform incomplete coverage of years specified

downloading CMIP6 models get sutck with some variables

Hello! I've been having issues downloading models. For some reason, it get stuck in most of the files that I've tried to pull. See some examples using this code for the variable o2.

Thanks for any help!

idx <- init_cmip6_index(
  # only consider ScenarioMIP activity
    activity = "ScenarioMIP",
  # specify dry-bulb temperature and relative humidity
    variable = "o2",
  # specify report frequent
    frequency = "mon",
  # specify experiment name
    experiment = c("ssp126", "ssp245", "ssp585"),
  # specify GCM name
    source = NULL,
  # specify variant
    variant = "r1i1p1f1",
  # More options
    replica = FALSE,
    latest = TRUE,
    resolution = NULL,
    data_node = NULL,
  # specify years of interest
    years = c(seq(2022, 2100, 1)),
  # save to data dictionary
    save = TRUE,
)

esm <- idx$file_url[1]

download.file(url = esm[1],
              destfile = paste0("inputs/o2/", basename(esm[1])),
              cacheOK = TRUE,
              extra = "--random-wait --retry-on-http-error=503",
              mode = "wb")

`future_epw()` should return information about how the weather data is split

Currently, future_epw() directly returns the created Epw objects for future climate. I always find I have to do manual steps to process each generated EPW file name using regex to get an idea of the scenario of each output. It would be useful to return a data.frame containing the information about how the data is split and aggregated based on the by argument.

Add precipitation in downscaling

HDF5 format was not supported

Currently, summary_database() only lists files with .nc extensions. There are some GCMs that output files in HDF5 format.

Problem getting index file for monthly data

Cannot build index file for monthy data. There is conflict as "mon" becomes "Amon" after the first esfg query.

library(epwshiftr)

options(epwshiftr.dir = "tmp")
options(epwshiftr.verbose = TRUE)

# get CMIP6 data nodes
nodes <- get_data_node()

idx <- init_cmip6_index(
  activity = "ScenarioMIP",
  variable = "tas",
  frequency = "mon",
  source = c("EC-Earth3"),
  experiment = c("ssp126"),
  data_node = nodes[status == "UP", data_node],
  years = c(2050, 2080)
)

Here is the error I receive:

Error in esgf_query(activity = unique(q$activity_drs), variable = unique(q$variable_id),  : 
  Assertion on 'frequency' failed: Must be a subset of {'1hr','1hrCM','1hrPt','3hr','3hrPt','6hr','6hrPt','day','dec','fx','mon','monC','monPt','subhrPt','yr','yrPt'}, but is {'Amon'}.

Directly use `dataset_id` to query files instead of rebuilding the same query

init_cmip6_index() first sends a query for Dataset and then uses the same input for File, and finally merges these two results together. It should directly use the dataset_id from the Dataset query when fetching output file information.

Issue warnings if input data covers much less than climatological averages

Different `dataset_id` could link to the same dataset

dataset_id could not be used as the unique identifier of the dataset. It is specific to data node. This did not cause any problems for esgf_query(), but did result in duplicated entries in the results of init_cmip6_index() when replica is set to TRUE. Should use dataset_pid as the unique dataset identifier when building index.

q <- epwshiftr::esgf_query(
    activity = "ScenarioMIP",
    variable = "tas",
    frequency = "day",
    experiment = "ssp585",
    source = "AWI-CM-1-1-MR",
    variant = "r1i1p1f1",
    replica = TRUE,
    latest = TRUE,
    resolution = "100 km",
    limit = 10000L,
    data_node = NULL
)

q[, .(dataset_id, dataset_pid)]
#>                                                                                        dataset_id
#> 1:   CMIP6.ScenarioMIP.AWI.AWI-CM-1-1-MR.ssp585.r1i1p1f1.day.tas.gn.v20190529|esgf-data1.llnl.gov
#> 2: CMIP6.ScenarioMIP.AWI.AWI-CM-1-1-MR.ssp585.r1i1p1f1.day.tas.gn.v20190529|esgf-data3.diasjp.net
#> 3:       CMIP6.ScenarioMIP.AWI.AWI-CM-1-1-MR.ssp585.r1i1p1f1.day.tas.gn.v20190529|esgf.ceda.ac.uk
#> 4:       CMIP6.ScenarioMIP.AWI.AWI-CM-1-1-MR.ssp585.r1i1p1f1.day.tas.gn.v20190529|esgf.nci.org.au
#>                                          dataset_pid
#> 1: hdl:21.14100/a336f13f-a4d3-3b57-a45a-8f27f0ba01b8
#> 2: hdl:21.14100/a336f13f-a4d3-3b57-a45a-8f27f0ba01b8
#> 3: hdl:21.14100/a336f13f-a4d3-3b57-a45a-8f27f0ba01b8
#> 4: hdl:21.14100/a336f13f-a4d3-3b57-a45a-8f27f0ba01b8

unique(q[, -c("dataset_id", "data_node")])
#>    mip_era activity_drs institution_id     source_id experiment_id member_id
#> 1:   CMIP6  ScenarioMIP            AWI AWI-CM-1-1-MR        ssp585  r1i1p1f1
#>    table_id frequency grid_label  version nominal_resolution variable_id
#> 1:      day       day         gn 20190529             100 km         tas
#>              variable_long_name variable_units
#> 1: Near-Surface Air Temperature              K
#>                                          dataset_pid
#> 1: hdl:21.14100/a336f13f-a4d3-3b57-a45a-8f27f0ba01b8

^{Created on 2022-09-19 with reprex v2.0.2}

Cannot handle non-standard calendar, e.g., 360_day and noleap

See openjournals/joss-reviews#4030 (comment)

Can't find MIROC6 model results

I'm trying to download daily data from the MIROC6 model. I've checked that the result in question is available in the ESGF system. Looking at the idx response attributes, the shards looks funny.

Is there a limit as to which model we can search?

library(epwshiftr)

idx <- init_cmip6_index(
    # only consider ScenarioMIP activity
    activity = "ScenarioMIP",
    # specify variables
    variable = c("pr"),
    # specify report frequent
    frequency = "day",
    # specify experiment name
    experiment = c("ssp245"),
    # specify GCM name
    source = c("MIROC6"),
    # specify variant,
    variant = "r1i1p1f1"
    )
#> No matched data. Please examine the actual response using 'attr(x, "response")'.

# This query should return one result
# looking at the response attribute
attr(idx, 'response')
#> $responseHeader
#> $responseHeader$status
#> [1] 0
#> 
#> $responseHeader$QTime
#> [1] 28
#> 
#> $responseHeader$params
#> $responseHeader$params$df
#> [1] "text"
#> 
#> $responseHeader$params$q.alt
#> [1] "*:*"
#> 
#> $responseHeader$params$indent
#> [1] "true"
#> 
#> $responseHeader$params$echoParams
#> [1] "all"
#> 
#> $responseHeader$params$fl
#> [1] "*,score"
#> 
#> $responseHeader$params$start
#> [1] "0"
#> 
#> $responseHeader$params$fq
#> $responseHeader$params$fq[[1]]
#> [1] "type:Dataset"
#> 
#> $responseHeader$params$fq[[2]]
#> [1] "project:\"CMIP6\""
#> 
#> $responseHeader$params$fq[[3]]
#> [1] "activity_id:\"ScenarioMIP\""
#> 
#> $responseHeader$params$fq[[4]]
#> [1] "experiment_id:\"ssp245\""
#> 
#> $responseHeader$params$fq[[5]]
#> [1] "source_id:\"MIROC6\""
#> 
#> $responseHeader$params$fq[[6]]
#> [1] "variable_id:\"pr\""
#> 
#> $responseHeader$params$fq[[7]]
#> [1] "nominal_resolution:\"100km\" || nominal_resolution:\"50km\" || nominal_resolution:\"100 km\" || nominal_resolution:\"50 km\""
#> 
#> $responseHeader$params$fq[[8]]
#> [1] "variant_label:\"r1i1p1f1\""
#> 
#> $responseHeader$params$fq[[9]]
#> [1] "frequency:\"day\""
#> 
#> $responseHeader$params$fq[[10]]
#> [1] "replica:false"
#> 
#> $responseHeader$params$fq[[11]]
#> [1] "latest:true"
#> 
#> 
#> $responseHeader$params$rows
#> [1] "10000"
#> 
#> $responseHeader$params$q
#> [1] "*:*"
#> 
#> $responseHeader$params$shards
#> [1] "localhost:8983/solr/datasets,localhost:8985/solr/datasets,localhost:8987/solr/datasets,localhost:8988/solr/datasets,localhost:8990/solr/datasets,localhost:8993/solr/datasets,localhost:8994/solr/datasets,localhost:8995/solr/datasets,localhost:8996/solr/datasets,localhost:8997/solr/datasets"
#> 
#> $responseHeader$params$tie
#> [1] "0.01"
#> 
#> $responseHeader$params$facet.limit
#> [1] "2048"
#> 
#> $responseHeader$params$qf
#> [1] "text"
#> 
#> $responseHeader$params$facet.method
#> [1] "fc"
#> 
#> $responseHeader$params$facet.mincount
#> [1] "1"
#> 
#> $responseHeader$params$wt
#> [1] "json"
#> 
#> $responseHeader$params$facet.sort
#> [1] "lex"
#> 
#> 
#> 
#> $response
#> $response$numFound
#> [1] 0
#> 
#> $response$start
#> [1] 0
#> 
#> $response$maxScore
#> [1] 0
#> 
#> $response$docs
#> list()

^{Created on 2021-08-12 by the reprex package (v2.0.0)}

This is the expected result

Add disclaimer on using CMIP6 climate data

Use `utils:URLencode()`

esgf_query() uses a self-implemented URL encode approach, which is kind of a hack. It is better to take advantage of the utils::URLencode() function instead.

Failed to load local cmip6 index

options(epwshiftr.dir = here::here("data/cmip6"))
epwshiftr::load_cmip6_index()
#> Loading CMIP6 experiment output file index created at 2020-09-03 22:38:54.
#> Error in bmerge(i, x, leftcols, rightcols, roll, rollends, nomatch, mult,  : 
#>   Incompatible join types: x.datetime_start (double) and i.V1 (character)

joss-review Description of morphing

The term morphing is used in building simulations, but rarely in other fields. Can you update the documentation in morphing_epw() and explain what you mean by stretching and shifting? Maybe even with some formulas? You can still keep the reference to the paper, but some basic info would be very useful for users.

Otherwise, you could also mention the terms bias adjustment and downscaling, since these two are more common in other disciplines.

Add the `Statement of need` section in paper

Add a `dir` parameter in each function with default value being the value of `epwshiftr.dir` option

Currently, there is no way to let the user specify in a flexible way where they want to save the CMIP6 output index file. This sometime become cumbersome to let {epwshiftr} work together with {targets}.

Use r-lib/actions v2 branch

Failed to retrieve data node status via `get_data_node()`

The LLNL ESGF Node has transferred to use the new Metagrid UI, which makes get_data_node() fail to parse the data node status. Workrounds include:

Use the legacy interface from other federated ESGF Nodes
Update the parse to handle Metagrid UI

CRAN check fails when LLNL ESGF node is not available

Comments from Brian Ripley:

There is a variety of failures here, it seems both in contacting a
website and in the content of that site.  We need to remind you of the
CRAN policy

'Packages which use Internet resources should fail gracefully with an
informative message if the resource is not available or has changed (and
not give a check warning nor error).'

so this needs correction whether or not the resource recovers.

Use `fields` parameter to directly filter returned query results

Ref: Returned Metadata Fields

Empty values found for opaque sky cover

See: https://discourse.ladybug.tools/t/generated-epw-file-and-is-not-being-read/16797/6

joss-review Usage examples

The usage examples are often not run, and require auxiliary data that is not in the packages. Do you think it’s possible to add some data to the package, so the examples can be run? I know that putting raw GCM data in there is not possible, otherwise the package size will explode. But maybe crop some netcdf file to a little extent, and put some epw file, and then you can have real examples that can be run? What do you think?

Use `offset` for query result pagination

Currently, init_cmip6_index() only returns the first 10,000 records.

epwshiftr/R/esgf.R

Lines 446 to 452 in 09827d4

 if (nrow(qd) == 10000L) { 

 warning("The dataset query returns 10,000 results which ", 

 "hits the maximum record limitation of a single query using ESGF search RESTful API. ", 

 "It is possible that the returned Dataset query responses are not complete. ", 

 "It is suggested to examine and refine your query." 

 ) 

 }

It is enough for most use cases. But it will still be good to implement pagination.

Allow empty `replica` for query

According ESGF Search RESTful API, the default behavior is to returns all records (masters and replicas). Current implantation always specifies the replica parameter, which means to send a query to either return only master records or only replicas.

Use Euclidean or spheroid formulas to calculate distances and find matched grid points.

Document how the file index is managed and where to put netCDF files

Example code

Hello Jia!
I am here!!
I need an example of code as converse in Email earlier

Regards
ZZaman

Release epwshiftr v0.1.1

Prepare for release:

Check current CRAN check results
devtools::check(remote = TRUE, manual = TRUE)
devtools::check_win_devel()
rhub::check_for_cran()
Update cran-comments.md
Polish NEWS
Review pkgdown reference index for, e.g., missing topics

Submit to CRAN:

usethis::use_version('minor')
devtools::submit_cran()
Approve email

Wait for CRAN...

Accepted 🎉
usethis::use_dev_version()

hi.1 when i change source gcm in knitr always see 0 percent ,just AWI-CM-1-1-MR working!!!! 2.how i can adding another source beside this 11 source(amateur at r and r studio)

Please briefly describe your problem and what output you expect. If you have a question, please don't use this form. Instead, ask on https://stackoverflow.com/ or https://community.rstudio.com/.

Please include a minimal reproducible example (AKA a reprex). If you've never heard of a reprex before, start by reading https://www.tidyverse.org/help/#reprex.

Brief description of the problem

# insert reprex here

Use GitHub Actions for CI

Specify a minimal required version of checkmate

assert_multi_class() was added in {checkmate} Version 1.9.0 (2019-01-09). Even this version was published 3 years ago, it is still possible that the user may have already installed an older version. Actually, there is already an error reported from a user that failed to load {epwshiftr} due to a lower version of {checkmate}

Alpha values could be quite big and thus introduce results far from reality

For instance, for tas_day_AWI-CM-1-1-MR_ssp585_r1i1p1f1_gn_20500101-20501231.nc, the alpha value for November could be -19.95. Should introduce a logic to issue warnings about this case and fallback to Shift method.

data_mean[, .(lon, lat, dist, epw_mean, gcm_mean = value, delta, alpha)]
#>          lon      lat     dist        epw_mean       gcm_mean            delta           alpha
#>        <num>    <num>    <num>         <units>        <units>          <units>         <units>
#>  1: 106.4062 35.99986 89.25595 -7.5504032 [°C] -4.804854 [°C]  2.74554916 [°C]   0.6363705 [1]
#>  2: 106.4062 35.99986 89.25595 -3.9571429 [°C] -1.769299 [°C]  2.18784401 [°C]   0.4471152 [1]
#>  3: 106.4062 35.99986 89.25595  1.3489247 [°C]  3.581754 [°C]  2.23282969 [°C]   2.6552663 [1]
#>  4: 106.4062 35.99986 89.25595  8.2494444 [°C]  8.227525 [°C] -0.02191962 [°C]   0.9973429 [1]
#>  5: 106.4062 35.99986 89.25595 13.4138441 [°C] 19.642524 [°C]  6.22867986 [°C]   1.4643471 [1]
#>  6: 106.4062 35.99986 89.25595 16.8897222 [°C] 22.220674 [°C]  5.33095225 [°C]   1.3156329 [1]
#>  7: 106.4062 35.99986 89.25595 19.6094086 [°C] 26.158875 [°C]  6.54946686 [°C]   1.3339961 [1]
#>  8: 106.4062 35.99986 89.25595 18.2104839 [°C] 24.773162 [°C]  6.56267795 [°C]   1.3603791 [1]
#>  9: 106.4062 35.99986 89.25595 13.3270833 [°C] 21.648293 [°C]  8.32121006 [°C]   1.6243834 [1]
#> 10: 106.4062 35.99986 89.25595  7.0000000 [°C] 13.851030 [°C]  6.85103005 [°C]   1.9787186 [1]
#> 11: 106.4062 35.99986 89.25595 -0.2101389 [°C]  4.192878 [°C]  4.40301659 [°C] -19.9528880 [1]
#> 12: 106.4062 35.99986 89.25595 -6.1481183 [°C] -2.254458 [°C]  3.89366038 [°C]   0.3666907 [1]

	if (nrow(qd) == 10000L) {
	warning("The dataset query returns 10,000 results which ",
	"hits the maximum record limitation of a single query using ESGF search RESTful API. ",
	"It is possible that the returned Dataset query responses are not complete. ",
	"It is suggested to examine and refine your query."
	)
	}