We are using the "ufs_public_release" branches of fv3atm, FMS, and stochastic_physics,

The restart flags are: <div class="highlight highlight-source-python notranslate p

The restart flags are: <div class="highlight highlight-source-python

See also here for a similar report - <a class="issue-link js-issue-link" data-error-te

Restarting model changes final result about fv3atm HOT 26 CLOSED

noaa-emc commented on September 28, 2024

Restarting model changes final result

from fv3atm.

Comments (26)

climbfuji commented on September 28, 2024

One thing that comes to my mind is to check the interval for resetting the diagnostic variables (fhzero). Restarts only work if they coincide with a multiple of fhzero. I see that your fhzero value is 0.25, which would be ok, but I never tried using a fractional fhzero value in the first place. This wouldn't explain the differences for the dycore-only run, though.

Which flags are you toggling for the restart runs?
What differs at the end of the restart runs? Just the content of some files or also the checksums that are written out at the end of the integration?
Do I understand right that you are not using the ufs-weather-model on top of fv3atm? In the ufs-weather-model repo, "develop" branch, are regression tests that exercise the restart capability.

from fv3atm.

mcgibbon commented on September 28, 2024

The restart flags are:

    config['namelist']['fv_core_nml']['external_ic'] = False
    config['namelist']['fv_core_nml']['nggps_ic'] = False
    config['namelist']['fv_core_nml']['make_nh'] = False
    config['namelist']['fv_core_nml']['mountain'] = True
    config['namelist']['fv_core_nml']['warm_start'] = True
    config['namelist']['fv_core_nml']['na_init'] = 0

All prognostic and diagnostic variables written out by the restart routines differ.

We're using our own docker setup, not the ufs-weather-model repo.

I checked the restart tests (e.g. this one) and my impression is the tests are checking something different than what we've tested. In the ufs-weather-model regression tests, the test checks that when you initialize the model from a restart file and run it, it produces the same output as when this process was done earlier. What we have tested is that when you initialize the model from any condition and run it for a set period continuously, the result should be the same as when you interrupt and restart the run halfway through.

None of the tests seem do this in the ufs-weather-model repo.

from fv3atm.

climbfuji commented on September 28, 2024

The restart flags are:
    config['namelist']['fv_core_nml']['external_ic'] = False
    config['namelist']['fv_core_nml']['nggps_ic'] = False
    config['namelist']['fv_core_nml']['make_nh'] = False
    config['namelist']['fv_core_nml']['mountain'] = True
    config['namelist']['fv_core_nml']['warm_start'] = True
    config['namelist']['fv_core_nml']['na_init'] = 0
All prognostic and diagnostic variables written out by the restart routines differ.

We're using our own docker setup, not the ufs-weather-model repo.

I checked the restart tests (e.g. this one) and my impression is the tests are checking something different than what we've tested. In the ufs-weather-model regression tests, the test checks that when you initialize the model from a restart file and run it, it produces the same output as when this process was done earlier. What we have tested is that when you initialize the model from any condition and run it for a set period continuously, the result should be the same as when you interrupt and restart the run halfway through.

None of the tests seem do this in the ufs-weather-model repo.

NOAA-GSD is testing what you are trying to do for the GSD physics suite: https://github.com/NOAA-GSD/ufs-weather-model (branch gsd/develop). You are right about the way the regression tests are set up for the EMC branches, but as far as I know EMC tests the restart capabilities as you described them from time to time. Last I tested it successfully for the GFS suite is a few months back.

from fv3atm.

mcgibbon commented on September 28, 2024

NOAA-GSD is testing what you are trying to do for the GSD physics suite: https://github.com/NOAA-GSD/ufs-weather-model (branch gsd/develop).

Where are those tests defined? As far as I can tell the tests with "restart" in the filename in tests/tests are identical.

from fv3atm.

climbfuji commented on September 28, 2024

https://github.com/NOAA-GSD/ufs-weather-model/blob/gsd/develop/tests/tests/fv3_ccpp_gsd runs a 48h forecast. https://github.com/NOAA-GSD/ufs-weather-model/blob/gsd/develop/tests/tests/fv3_ccpp_gsd_coldstart runs a 24h forecast from the same initial conditions, and https://github.com/NOAA-GSD/ufs-weather-model/blob/gsd/develop/tests/tests/fv3_ccpp_gsd_warmstart uses the restart files from https://github.com/NOAA-GSD/ufs-weather-model/blob/gsd/develop/tests/tests/fv3_ccpp_gsd_coldstart to run another 24h forecast/warmstart. The results of the warmstart run are compared to the 48h single integration.

from fv3atm.

mcgibbon commented on September 28, 2024

I do not see similar automated tests for the GFS physics (without ccpp), which we are using (and which is causing the issue). You said

I know EMC tests the restart capabilities as you described them from time to time. Last I tested it successfully for the GFS suite is a few months back.

It would be helpful to get the build and run configuration used so I can test them with our set-up.

from fv3atm.

climbfuji commented on September 28, 2024

I will see if I find time this week to create those tests. Essentially I'd use a setup similar to the GSD tests described above but with the GFSv15p2 physics.

from fv3atm.

climbfuji commented on September 28, 2024

Just to give you an update, with the AMS this week I haven't been able to set up these tests yet.

from fv3atm.

climbfuji commented on September 28, 2024

I can confirm that with the namelist settings in the ufs_public_release branches for the GFS_v15p2 tests the restarts do not work. I am now trying to fix this, I've got a few ideas what may be the difference to the tests that we know are b4b reproducible in restart runs.

from fv3atm.

climbfuji commented on September 28, 2024

See also here for a similar report - ufs-community/ufs-mrweather-app#62

from fv3atm.

climbfuji commented on September 28, 2024

See here for an update ufs-community/ufs-mrweather-app#62 .

from fv3atm.

DusanJovic-NOAA commented on September 28, 2024

Please take a look at this branch https://github.com/DusanJovic-NOAA/ufs-weather-model/tree/ctrl_rest.

I modified fv3_restart test slightly to use 12h restart files from a previous fv3_control run and compare 24h outputs against 24h outputs from a fv3_control run. I believe this is the actual restart test that verifies that the restart run is b4b identical to the cold start run. If you have access to one of the supported platforms, clone this branch and run:

./rt.sh -l rt_restart.conf -c && ./rt.sh -l rt_restart.conf -m

This is in develop branch but I hope it will help you figure out why your tests are not bit-reproducible.

from fv3atm.

climbfuji commented on September 28, 2024

Ok, I got this figured out. The culprit was the nstf_name setting in the input.nml namelists that we received originally for v15p2 and v16beta:

nstf_name = 2,1,0,0,0

For restarts, this would have to be

nstf_name = 2,0,0,0,0

but it doesn't work. I changed this back to the default value in all incoupled ufs-weather-model regression tests in the develop branch

nstf_name = 2,1,1,0,5

or, for restarts,

nstf_name = 2,0,1,0,5

and then I get b4b identical results for both v15p2 and v16beta in PROD mode (cheyenne.intel, hera.intel, cheyenne.gnu).

I created a PR for ufs_public_release here: ufs-community/ufs-weather-model#33.
In this PR, I am also adding the regression tests "properly", i.e. there is a 48h full forecast when the baseline is created, and a 24h coldstart plus a 24h restart run, both verified against the 48h forecast.

from fv3atm.

climbfuji commented on September 28, 2024

Attached is a tarball with the run directories of the full rt.conf regression test suite for the above PR as it was generated on Cheyenne for the Intel compiler. Discard whatever you don't need and look at the input.nml and model_configure files. These are the only ones that need to be changed. For restart runs, copy/link the contents (files) of the RESTART directory from the previous run into the INPUT directory of the restart run. Hope this helps!

rundirs_cheyenne_intel_config_files.tar.gz

from fv3atm.

junwang-noaa commented on September 28, 2024

Please take a look at this branch https://github.com/DusanJovic-NOAA/ufs-weather-model/tree/ctrl_rest.

I modified fv3_restart test slightly to use 12h restart files from a previous fv3_control run and compare 24h outputs against 24h outputs from a fv3_control run. I believe this is the actual restart test that verifies that the restart run is b4b identical to the cold start run. If you have access to one of the supported platforms, clone this branch and run:
./rt.sh -l rt_restart.conf -c && ./rt.sh -l rt_restart.conf -m
This is in develop branch but I hope it will help you figure out why your tests are not bit-reproducible.

Thanks for fixing this. We need to commit this change.

from fv3atm.

climbfuji commented on September 28, 2024

Great, thanks! This is exactly what I have done for GSD's restart test in the past (https://github.com/NOAA-GSD/ufs-weather-model, branch gsd/develop; modulo that I am running 48h forecasts when creating baselines and then two 24h forecasts, a coldstart and a restart, comparing against the 48h runs). I created a PR for ufs_public_release that does the same thing this evening. Yes, please create a PR for "develop" so that we can fix the restart tests.

from fv3atm.

junwang-noaa commented on September 28, 2024

@mcgibbon is this problem resolved?

from fv3atm.

mcgibbon commented on September 28, 2024

It is not, thanks for the ping. I haven't had time to test the fixes. I'll give it higher priority so we can ideally close this issue.

from fv3atm.

mcgibbon commented on September 28, 2024

The nstf_name change did not fix the issue for my installation. The fix someone else encountered of setting a compile flag didn't apply, since it was an intel-only compile option and I'm using the gfortran compiler. I've gone through and made sure each namelist matches with your example as much as possible (considering the input data I have available), still no reproducible run.

If you're curious, the test I have written is here and the install procedure I'm using is here.

I assumed ufs_public_release held a particular stable release of the code? Is there a stable "version" I can acquire somewhere?

from fv3atm.

climbfuji commented on September 28, 2024

@mcgibbon FYU, the ufs_public_release branches give b4b identical results for restart runs with GNU 8.3.0 on Cheyenne.

from fv3atm.

mcgibbon commented on September 28, 2024

It's difficult to tell the cause of this difference, because I don't have any way to reproduce the configuration you have on Cheyenne.

from fv3atm.

climbfuji commented on September 28, 2024

Where are you running on? Is there a shared platform that we may be able to use? Do you have access to Amazon Web Service? If everything else fails, I can - time permitting - fire up an instance there, run the restart tests and give you access to the instance to look at the code/rundirs (and download it). You could also upload your stuff and we could take a look. But this can't happen this week I am afraid.

from fv3atm.

mcgibbon commented on September 28, 2024

If you have a Docker file or image somewhere that includes the necessary run data I could use that. I've just been running locally for development, though we also do jobs on Google cloud. I got the ufs-weather-model to build in docker and could upload that to github if you'd like, it would just take some time to make proper forks because I had to make some small changes to the submodule repos to get it building.

Our stuff is all uploaded in this repo for the fortran model. It makes use of this tool we've made for setting up run directories, which will grab our run data from the web. Assuming docker is installed and you have a python 3 environment, everything should build and the tests should run with the following in the root directory:

$ make build
$ pip3 install -r requirements.txt
$ python3 -m fv3config.download_data
$ python3 tests/test_fv3_exe_restart.py

The final script spits out the run directories inside the root directory. The pip line will grab a particular commit of our fv3config tool. The download_data command puts data in your system-dependent user cache directory, which you can locate by looking at the symlink target in the resulting run directory. The cases run at C48 for a total of 8 hours, which finishes in a couple minutes on 6 cores on my machine and uses slightly more than 4GB of memory. You may need to increase the default docker memory allocation (I think I have mine set to 8GB).

If the test fails with a message about a missing output file or x not in list, you should read the fv3err file (standard error log), because it means the model run failed.

from fv3atm.

climbfuji commented on September 28, 2024

@mcgibbon we addressed the restart reproducibility issues in the authoritative branches (ufs-weather-model develop) a while ago. Can you confirm it is working correctly now? Thanks.

from fv3atm.

mcgibbon commented on September 28, 2024

That's great to hear, both that you ran into these issues (the last I heard was that the tests were passing in the UFS model), and that it's fixed. Can you give me any more detail about where this came up in ufs-weather-model?

We don't currently have our code set up to use the ufs-weather-model build system, so I can't quickly update the sources and run the test. This is something we're planning to do soon, so I'll re-run this test when we do so. For now, I'll close this issue.

from fv3atm.

climbfuji commented on September 28, 2024

The two commits for fv3atm (develop) are listed in ufs-community/ufs-weather-model#208 and ufs-community/ufs-weather-model#325. You can also look at just the required changes for the release/public-v2 branches (for the upcoming SRW App release), which went in today (again in fv3atm mentioned here: ufs-community/ufs-weather-model#417).

from fv3atm.

Restarting model changes final result about fv3atm HOT 26 CLOSED

Comments (26)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent