radical-collaboration / facts Goto Github PK

Repository for the Framework for Accessing Changes To Sea-level (FACTS)

License: MIT License

Python 35.52% R 6.29% Shell 0.11% Jupyter Notebook 57.99% Dockerfile 0.09%

facts's Introduction

Framework for Assessing Changes To Sea-level (FACTS)

The Framework for Assessing Changes To Sea-level (FACTS) is an open-source modular, scalable, and extensive framework for global mean, regional, and extreme sea level projection that is designed to support the characterization of ambiguity in sea-level projections. It is designed so users can easily explore deep uncertainty by investigating the implications on GMSL, RSL, and ESL of different choices for different processes. Its modularity allows components to be represented by either simple or complex model. Because it is built upon the Radical-PILOT computing stack, different modules can be dispatched for execution on resources appropriate to their computational complexity.

FACTS is being developed by the Earth System Science & Policy Lab and the RADICAL Research Group at Rutgers University. FACTS is released under the MIT License.

See fact-sealevel.readthedocs.io for documentation.

For model description, see Kopp, R. E., Garner, G. G., Hermans, T. H. J., Jha, S., Kumar, P., Reedy, A., Slangen, A. B. A., Turilli, M., Edwards, T. L., Gregory, J. M., Koubbe, G., Levermann, A., Merzky, A., Nowicki, S., Palmer, M. D., & Smith, C. (2023). The Framework for Assessing Changes To Sea-Level (FACTS) v1.0: A platform for characterizing parametric and structural uncertainty in future global, relative, and extreme sea-level change. Geoscientific Model Development, 16, 7461–7489.

facts's People

Contributors

Stargazers

Watchers

Forkers

karahbit victor-malagon benjaminharrisonmo jkrasting gomeskingsley pkjr002 jetesdal kdorheim bryanparthum timh37

facts's Issues

Update EnTK and fix namespace for compatibility

EnTK needs to be updated and module namespace adjusted so that FACTS.py runs properly.

Deposit deployable version of FACTS 1.0 and associated data sets in FAIR-aligned repo (e.g., Zenodo)

Develop user, module developer, and FACTS developer documentation.

Code base needs documentation.

Create initial pipeline with placeholders for tasks from the 4 stages

Use xarray (w/dask) in workflow

Many geoscientific datasets (such as the CMIP6 archive) are in netCDF format. It's likely one would need to open multiple netCDF files within their workflow, potentially loading 10s to 100s of GB of data into system memory.

The xarray Python module addresses this potential problem, along with some quality-of-life improvements to the way these datasets are handled and analyzed. The xarray module leverages the dask Python module to "chunk" the datasets into smaller pieces and perform operations on these smaller chunks. No operations are performed until the values need to be calculated (lazy xarray vs. aggressive numpy). This means, less system memory will be used to perform the same operations. The xarray module is gaining popularity among geoscientists and it's very likely that a FACTS module may need it. Which begs the question...

Would EnTK be able to handle a workflow that uses the xarray Python module (with dask)?

Propose data staging approach to kopp14

The requirement is to save intermediate data to a dedicate data directory

Draft model description manuscript

FACTS should be described in a manuscript submitted to an appropriate peer-reviewed journal (e.g., Geoscientific Model Development)

Remove data from sub-model modules

Remove the data files from the sub-model modules. Must keep github happy!

Define location for radical.pilot.sandbox when using localhost?

Sorry if this is a dumb question, but is there a way to define the location where "radical.pilot.sandbox" should be set up when running on localhost? Right now the default is the user's home directory.

The problem I'm running into is that I have limited space on the partition that hosts my home directory. Transferring large files through EnTK to this sandbox fills my partition completely and causes problems. By setting the location for the sandbox, I may be able to avoid this.

Any help is much appreciated!

Add pointers to data sets needed for modules

The current modules rely on data sets (and, in some cases, third party code, e.g. emulandice) not included in the repo. Directions should be provided to obtain these data sets.

Allow automated handoff of temperature projections from FAIR module to sea-level process models

It should be possible to run a FACTS workflow that starts with emissions trajectory and turns these into temperature trajectories for direct input into sea level modules.

Implement test routines for all modules

Now that we have a fairly standardizing module testing system, this should be applied to all modules.

Run coupled experiments with new emissions scenarios driven by fair

Should be able to demonstrate the AR6 medium-confidence methods with new emissions scenarios (e.g., RFF-SPs)

kopp14/oceandynamics test script fails

Traceback (most recent call last):
  File "kopp14_preprocess_oceandynamics.py", line 184, in <module>
    kopp14_preprocess_oceandynamics(args.scenario, args.zostoga_model_dir, args.zos_model_dir, not args.no_drift_corr, args.baseyear, args.pyear_start, args.pyear_end, args.pyear_step, args.locationfile, args.pipeline_id)
  File "kopp14_preprocess_oceandynamics.py", line 120, in kopp14_preprocess_oceandynamics
    sZOS = np.apply_along_axis(nanSmooth, axis=0, arr=ZOS, w=smoothwin)
  File "<__array_function__ internals>", line 6, in apply_along_axis
  File "/home/rk509/.conda/envs/conda-entk/lib/python3.7/site-packages/numpy/lib/shape_base.py", line 402, in apply_along_axis
    buff[ind] = asanyarray(func1d(inarr_view[ind], *args, **kwargs))
  File "kopp14_preprocess_oceandynamics.py", line 117, in nanSmooth
    temp[idx] = Smooth(x[idx], w)
ValueError: shape mismatch: value array of shape (37,) could not be broadcast to indexing result of shape (1,)

Is there an easy way to parse pipeline.yml into a shell script?

For many modules, a straightforward way of testing is to run on local resources at a limited number of space-time points without the EnTK overhead and comparing to target values. Greg effectively did this for the IPCC AR6 execution by manually creating shell scripts (and manually comparing to target values), but it seems like there should be a straightforward way of generating such scripts from pipeline.yml. Does such a method exist, or could one of the RADICAL team help with its creation?

Porting to Python 3

ipccar6/gmipemuglaciers pipeline.yml has wrong filenames

EnTK App Manager hangs when processing second workflow

I submit a list of pipelines (SLR contributors) to the app manager and run the app manager.

# Initialize the EnTK App Manager
amgr = AppManager(hostname=rcfg['rabbitmq']['hostname'], port=rcfg['rabbitmq']['port'])
	
# Apply the resource configuration provided by the user
res_desc = {'resource': rcfg['resource-desc']['name'],
	'walltime': rcfg['resource-desc']['walltime'],
	'cpus': rcfg['resource-desc']['cpus'],
	'queue': rcfg['resource-desc']['queue'],
	'project': rcfg['resource-desc']['project']}
amgr.resource_desc = res_desc
	
# Assign the list of pipelines to the workflow
amgr.workflow = pipelines
		
# Run the workflow
amgr.run()

This executes correctly (pipelines and rcfg have already been defined). I then assign a new list of pipelines to the same app manager and attempt to run the workflow.

# New pipeline
p1 = Pipeline()
p1.name = "Test-pipeline"
	
# First stage with one task
s1 = Stage()
s1.name = "Test-stage"
t1 = Task()
t1.name = "Test-task"
t1.executable = '/bin/sleep'
t1.arguments = ['1']
	
# Second stage with one task
s2 = Stage()
s2.name = "Test-stage2"
t2 = Task()
t2.name = "Test-task2"
t2.executable = '/bin/sleep'
t2.arguments = ['5']
	
# Assign tasks and stages to pipeline
s1.add_tasks(t1)
s2.add_tasks(t2)
p1.add_stages(s1)
p1.add_stages(s2)
	
# Assign the pipeline to the workflow and run
amgr.workflow = [p1]
amgr.run()

The workflow hangs, seemingly indefinitely, after submitting the first task/stage/pipeline. I need to send a KeyboardInterrupt to stop execution. The error seems to be related to checking for a heartbeat.

Traceback (most recent call last):
  File "/home/ggg46/slr_venv3/lib/python3.7/site-packages/radical/entk/execman/rp/task_manager.py", line 195, in _tmgr
    raise EnTKError(e)
radical.entk.exceptions.EnTKError: (404, "NOT_FOUND - no queue 're.session.ggg46-vb.ggg46.018393.0004-hb-request' in vhost '/'")

Am I doing something wrong, or is this a potential bug in EnTK? Attached are the log files from this session.
verbose.log
re_logs.zip

PR for structuring the repository as a Python module

Add DeConto and Pollard 2021 to emulandice to allow dependence of DP21 results on temperature

(Note this is a module feature request rather than a FACTS core functionality request; Tamsin and Rob should be involved)

Tasks fail - unable to locate python modules

I run, within a virtual environment, "python FACTS.py ./experiments/temp_exp/" (from the restructure_facts branch). Everything seems to go ok until the individual tasks are executed. Each task is returned with a "fail" status. The kopp14-icesheets pre-process stage (one task) produces error output indicating that numpy cannot be found. Running the associated code outside of FACTS works as expected. It seems like there's a mismatch in Python environments between running code inside versus outside EnTK.

Document post-1.0 thoughts on general developments, conversion to library, development of GISS modules, and hindcasts

MongoDB down?

I’m trying to run a FACTS workflow, but it fails trying to create the session. Is there something wrong with the server running mongodb? Error is below:

(facts_env) greg@greg-VirtualBox:~/research/facts/re.session.greg-VirtualBox.greg.018514.0001$ more radical.entk.resource_manager.0000.log 
1599679333.178 : radical.entk.resource_manager.0000 : 10835 : 140535638288192 : ERROR    : Resource request submission failed
Traceback (most recent call last):
  File "/home/greg/facts_env/lib/python3.8/site-packages/radical/pilot/session.py", line 177, in _initialize_primary
    self._dbs = DBSession(sid=self.uid, dburl=dburl,
  File "/home/greg/facts_env/lib/python3.8/site-packages/radical/pilot/db/database.py", line 50, in __init__
    self._mongo, self._db, _, _, _ = ru.mongodb_connect(str(dburl))
  File "/home/greg/facts_env/lib/python3.8/site-packages/radical/utils/misc.py", line 120, in mongodb_connect
    db.authenticate(user, pwd)
  File "/home/greg/facts_env/lib/python3.8/site-packages/pymongo/database.py", line 1492, in authenticate
    self.client._cache_credentials(
  File "/home/greg/facts_env/lib/python3.8/site-packages/pymongo/mongo_client.py", line 775, in _cache_credentials
    server = self._get_topology().select_server(
  File "/home/greg/facts_env/lib/python3.8/site-packages/pymongo/topology.py", line 241, in select_server
    return random.choice(self.select_servers(selector,
  File "/home/greg/facts_env/lib/python3.8/site-packages/pymongo/topology.py", line 199, in select_servers
    server_descriptions = self._select_servers_loop(
  File "/home/greg/facts_env/lib/python3.8/site-packages/pymongo/topology.py", line 215, in _select_servers_loop
    raise ServerSelectionTimeoutError(
pymongo.errors.ServerSelectionTimeoutError: 129.114.17.185:27017: timed out, Timeout: 30s, Topology Description: <TopologyDescription id: 5f592b47b9c5ff5c7873
11a6, topology_type: Single, servers: [<ServerDescription ('129.114.17.185', 27017) server_type: Unknown, rtt: None, error=NetworkTimeout('129.114.17.185:2701
7: timed out')>]>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/greg/facts_env/lib/python3.8/site-packages/radical/entk/execman/rp/resource_manager.py", line 147, in _submit_resource_request
    self._session = rp.Session(uid=self._sid)
  File "/home/greg/facts_env/lib/python3.8/site-packages/radical/pilot/session.py", line 148, in __init__
    self._initialize_primary(dburl)
  File "/home/greg/facts_env/lib/python3.8/site-packages/radical/pilot/session.py", line 192, in _initialize_primary
    raise RuntimeError ('session create failed [%s]' %
RuntimeError: session create failed [mongodb://facts:****@129.114.17.185/facts]

Validate global and NYC data produced by modules in test scripts

Tracking spreadsheet at https://docs.google.com/spreadsheets/d/1ByCHmMuGtm-audVhm1ACbPMFGNLBbrRexzFe902A4L4/edit?usp=sharing

Benchmark framework performance

tlm/oceandynamics has no pipeline.yml file

Develop alternative to FACTS.py that allows modules to be called from Jupyter notebook

Many users will prefer a notebook-based interaction to tinkering with experiment files.

"KeyboardInterrupt" detected when submitting pilot

I'm trying to run a workflow through the current version of FACTS (devel branch; workflow configuration is included there). The code runs as expected until it tries to submit the pilot. For some reason, it detects a "keyboardInterrupt" somewhere and closes out of everything. I can assure you that I'm not pressing anything on the keyboard while the code is running.

Below is my call to FACTS and the resulting STDOUT:

(slr_venv3) ggg46@ggg46-vb:~/research/slr_framework/code/facts$ python3 FACTS.py experiments/temp_exp
EnTK session: re.session.ggg46-vb.ggg46.018477.0001
Creating AppManagerSetting up RabbitMQ system                                 ok
                                                                              ok
Validating and assigning resource manager                                     ok
Setting up RabbitMQ system                                                   n/a
new session: [re.session.ggg46-vb.ggg46.018477.0001]                           \
database   : [mongodb://facts:[email protected]/facts]          ok
create pilot manager                                                          ok
submit 1 pilot(s)
        pilot.0000   local.localhost           2 cores       0 gpus           ok
closing session re.session.ggg46-vb.ggg46.018477.0001                          \
close pilot manager                                                            \
wait for 1 pilot(s)
              0                                                               ok
                                                                              ok
session lifetime: 19.1s                                                       ok
wait for 1 pilot(s)
              0                                                          timeout
All components terminated
Traceback (most recent call last):
  File "/home/ggg46/slr_venv3/lib/python3.7/site-packages/radical/entk/execman/rp/resource_manager.py", line 179, in _submit_resource_request
    self._pilot.wait([rp.PMGR_ACTIVE, rp.DONE, rp.FAILED, rp.CANCELED])
  File "/home/ggg46/slr_venv3/lib/python3.7/site-packages/radical/pilot/compute_pilot.py", line 536, in wait
    time.sleep(0.1)
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ggg46/slr_venv3/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 416, in run
    self._rmgr._submit_resource_request()
  File "/home/ggg46/slr_venv3/lib/python3.7/site-packages/radical/entk/execman/rp/resource_manager.py", line 192, in _submit_resource_request
    raise KeyboardInterrupt
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "FACTS.py", line 432, in <module>
    run_experiment(args.edir, args.debug, args.no_total)
  File "FACTS.py", line 372, in run_experiment
    amgr.run()
  File "/home/ggg46/slr_venv3/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 441, in run
    raise KeyboardInterrupt
KeyboardInterrupt

Below is the result of radical-stack:

(slr_venv3) ggg46@ggg46-vb:~/research/slr_framework/code/facts$ radical-stack

  python               : 3.7.5
  pythonpath           : /home/ggg46/pylibs/ssht
  virtualenv           : /home/ggg46/slr_venv3

  radical.entk         : 1.4.1.post1
  radical.pilot        : 1.4.1
  radical.saga         : 1.4.0
  radical.utils        : 1.4.0

Everything was running as expected from start to finish as recently as 29 July (Wednesday) afternoon. I first noticed the problem on 31 July (Friday). The only change I made was to the workflow configuration file to add additional (tested and working) modules to the workflow. I've also reverted the configuration file back to what it was on 29 July, but the problem persists.

Any insight would be greatly appreciated.

Add linear emulator of Bamber et al 2019 results to estimate behavior in between 2 C and 5 C elicited scenarios

Support for Amarel

radical-cybertools/radical.pilot#1813

Check if LARMIP module includes AR6-style SMB adjustment

Validate runthrough of AR6 workflows invoked through FACTS.py

Validate for GMSL and a subset of RSL runs that pre-release FACTS can replicate the results of the AR6 workflow configurations

Start using Sphinx for documentation

https://www.sphinx-doc.org/en/master/

PR with discussed command line tool

ar5/glacierscmip6, ar5/glaciersfair, ar5/thermalexpansion test scripts fail

Traceback (most recent call last):
  File "ar5_project_glacierscmip6.py", line 244, in <module>
    ar5_project_glacierscmip6(args.seed, args.pyear_start, args.pyear_end, args.pyear_step, args.nmsamps, args.ntsamps, args.nsamps, args.pipeline_id)
  File "ar5_project_glacierscmip6.py", line 167, in ar5_project_glacierscmip6
    year_var[:] = targyears
  File "src/netCDF4/_netCDF4.pyx", line 4903, in netCDF4._netCDF4.Variable.__setitem__
  File "/home/rk509/.conda/envs/conda-entk/lib/python3.7/site-packages/netCDF4/utils.py", line 356, in _StartCountStride
    datashape = broadcasted_shape(shape, datashape)
  File "/home/rk509/.conda/envs/conda-entk/lib/python3.7/site-packages/netCDF4/utils.py", line 964, in broadcasted_shape
    return np.broadcast(a, b).shape
ValueError: shape mismatch: objects cannot be broadcast to a single shape

Seeking a basic usage flow

After seeing this repo mentioned in one of Bob Kopp's presentations, I went to try it. Here are some notes on how far I got, which might be helpful in the documentation process. What I am looking for is basic instructions on running the FACTS machinery according to the configurations used in AR6.

First, seeing the setup.py file, I tried to install facts. The installation ends with the error:

copying other/FACTS_poster_20190310.pdf -> build/bdist.linux-x86_64/egg/share/facts/other
error: can't copy 'other/test_xarray': doesn't exist or not a regular file

Seeing that other just has a unit test and a poster, I dropped that line from setup.py. Then it installs.

Next, running FACTS.py, I get the error that I'm missing radical.entk. I'm guessing that I need radical.entk==0.72.1, for python2. So I installed that too.

Then running FACTS.py again with the temp_exp experiment, I get the error

pika.exceptions.ConnectionClosed: Connection to 127.0.0.1:5672 failed: [Errno 111] Connection refused

What do I need to have running on that port?

LARMIP/icesheet module uses different file format for global temperature than outputted by FAIR/temperature module

Temperature input should be adjusted to align with emulandice and fair. Should be fairly straightforward to adjust the LARMIP preprocessing stage
to use the GetSamples function from emulandice preprocessing state.

Rename repository to FACTS

Starting to look into documenting the module

Sphinx: http://www.sphinx-doc.org/en/master/
readthedocs: https://readthedocs.org/

tlm/oceandynamics module test fails

tlm_fit_oceandynamics.py:209: RuntimeWarning: invalid value encountered in true_divide
  OceanDynTECorr = corr_num / corr_denom
Traceback (most recent call last):
  File "tlm_postprocess_oceandynamics.py", line 265, in <module>
    tlm_postprocess_oceandynamics(args.nsamps, args.seed, args.chunksize, args.keep_temp, args.pipeline_id)
  File "tlm_postprocess_oceandynamics.py", line 186, in tlm_postprocess_oceandynamics
    combined = xr.open_mfdataset("{0}_tempsamps_*.nc".format(pipeline_id), concat_dim="locations", chunks={"locations":chunksize})
  File "/home/rk509/.conda/envs/conda-entk/lib/python3.7/site-packages/xarray/backends/api.py", line 889, in open_mfdataset
    "When combine='by_coords', passing a value for `concat_dim` has no "
ValueError: When combine='by_coords', passing a value for `concat_dim` has no effect. To manually combine along a specific dimension you should instead specify combine='nested' along with a value for `concat_dim`.
mv: cannot stat ‘*localsl*’: No such file or directory

EnTK Documentation/Questions

In developing the "totaling" and "extreme sea-level" portions of the framework, I've come up with a list of questions about EnTK and how it works. Below are the questions:

In section 6.1 in the EnTK documentation, the use of a stage's "post_exec" method is ambiguous. The documentation here and in the Stage API (section 7.1.2) indicate the use of a dictionary whereas the example illustrates the use of a function. Which is the appropriate usage?
Does a shared directory for the EnTK session exist that I can have individual tasks write to, even if it's not setup within the application manager?
The description of the "copy_output_data" in the Task API (7.1.3) is ambiguous. This method should copy task output files to a new location on the remote machine within the same EnTK session. I would expect this to require information about both the file to copy and where to copy it. The syntax for "download_output_data" looks closer to what I would expect for this method. What's the proper syntax for the "copy_output_data" usage? Perhaps an example would be useful?

Define a first set of unit tests

Information about the sea rise project

Test script for module ipccar6/larmipicesheet does not complete

Update PR with repository restructuring

Can EnTK/RP be supported on Windows?

Following RP ticket radical-cybertools/radical.pilot#1812

Use of Task.cpu_reqs in EnTK

TL;DR - How does setting "processes" to anything greater than one in Task.cpu_reqs affect available memory for the task in EnTK?

This may be a dumb question (and please don’t hesitate to point that out if it’s the case), but I’m a little confused by Task.cpu_reqs and how it relates to the resources provided to the application manager through the resource description AppManager.resource_desc.

My initial impression was that defining the number of CPUs in the application manager sets the total number of CPUs available for the entire workflow. Task.cpu_reqs would set the number of CPUs required for the particular task. My confusion stems from running my test_xarray workflow in EnTK on Rutger's Amarel resources. It apparently doesn’t matter how many CPUs I define in the application manager (as long as they’re available on the remote machine), the code runs fine. I then try to assign multiple CPUs to each task in the workflow through Task.cpu_reqs:

Task.cpu_reqs = {"processes": 4, "process_type": None, "threads_per_process": 1, "thread_type": None}

…the tasks fail to run due to “Insufficient Resources”. I make sure that the product of “processes”, “threads_per_process”, and number of tasks is less than or equal to the number of CPUs provided to the application manager, so the only other resource I can think of that could be insufficient is memory. If Task.cpu_reqs behaves how I think it does, it shouldn’t impact the amount of memory available to the particular task…which means my intuition about Task.cpu_reqs is incorrect.

One potential reason for the insufficient resource error I can think of would be if setting “processes” in Task.cpu_reqs spawns some sort of distributed memory situation (similar to MPI) where (in the example above) four instances of the task are run concurrently each with its own memory allocation. If that’s the case, then I can definitely see where the insufficient resources error would be coming from.

Sorry for the rambling, but any insight you all can provide into this would be greatly appreciated!

MongoDB issue

From Greg's email:

"I just installed a new VM and decided to setup EnTK from scratch. All goes well until it comes time to register and connect to a mongodb service. The one you use in your tutorial (mlabs.com) is no longer accepting new user accounts. mLab is now part of MongoDB Atlas. So I created an account, set up a free tier cluster, and associated a user with the database.

The problem comes in connecting to the database. I get the connection string for the python driver, assign it to RADICAL_PILOT_DBURL, and run the "get_started.py" sample script."

Validate emulandice and larmip modules against offline versions used for AR6.

The temperature-driven emulandice and larmip modules were developed by Greg after AR6; for AR6, these emulators were run offline and results were used by direct sampling (see ipccar6/gmipemuglaciers, ipccar6/ismipemuicesheet, ipccar6/larmipicesheet modules). We need to verify that the online emulandice and larmip modules produce the same results.

Versions of some modules used in AR6 aren't in repo

In particular, the direct sample modules "bambericesheet", "gmpemuglaciers", "ismipemuicesheet" and "larmipicesheet" aren't here. While these may be replaced by the directsample module (and more sophisticated implementations of larmip and emulandice), I think we do want traceability to the AR6 implementation, even if these are deprecated.