Git Product home page Git Product logo

modulus-makani's Introduction

Modulus-Makani: Massively parallel training of machine-learning based weather and climate models

Overview | Getting started | More information | Contributing | Further reading | References

tests

Makani (the Hawaiian word for wind ๐Ÿƒ๐ŸŒบ) is an experimental library designed to enable the research and development of machine-learning based weather and climate models in PyTorch. Makani is used for ongoing research. Stable features are regularly ported to the NVIDIA Modulus framework, a framework used for training Physics-ML models in Science and Engineering.

Overview

Makani was started by engineers and researchers at NVIDIA and NERSC to train FourCastNet, a deep-learning based weather prediction model.

Makani is a research code built for massively parallel training of weather and climate prediction models on 100+ GPUs and to enable the development of the next generation of weather and climate models. Among others, Makani was used to train Spherical Fourier Neural Operators (SFNO) [1] and Adaptive Fourier Neural Operators (AFNO) [2] on the ERA5 dataset. Makani is written in PyTorch and supports various forms of model- and data-parallelism, asynchronous loading of data, unpredicted channels, autoregressive training and much more.

Getting started

Training:

Training is launched by calling train.py and passing it the necessary CLI arguments to specify the configuration file --yaml_config and he configuration target --config:

mpirun -np 8 --allow-run-as-root python -u train.py --yaml_config="config/sfnonet.yaml" --config="sfno_linear_73chq_sc3_layers8_edim384_asgl2"

โš ๏ธ architectures with complex-valued weights: Training some architectures with complex-valued weights requires yet to be released patches to PyTorch. These will currently fail - a patch will follow soon.

Makani supports various optimization to fit large models ino GPU memory and enable computationally efficient training. An overview of these features and corresponding CLI arguments is provided in the following table:

Feature CLI argument options
Automatic Mixed Precision --amp_mode none, fp16, bf16
Just-in-time compilation --jit_mode none, script, inductor
CUDA graphs --cuda_graph_mode none, fwdbwd, step
Activation checkpointing --checkpointing_level 0,1,2,3
Data parallelism --batch_size 1,2,3,...
Channel parallelism --fin_parallel_size, --fout_parallel_size 1,2,3,...
Spatial model parallelism --h_parallel_size, --w_parallel_size 1,2,3,...
Multistep training --multistep_count 1,2,3,...

Especially larger models are enabled by using a mix of these techniques. Spatial model parallelism splits both the model and the data onto multiple GPUs, thus reducing both the memory footprint of the model and the load on the IO as each rank only needs to read a fraction of the data. A typical "large" training run of SFNO can be launched by running

mpirun -np 256 --allow-run-as-root python -u train.py --amp_mode=bf16 --cuda_graph_mode=fwdbwd --multistep_count=1 --run_num="ngpu256_sp4" --yaml_config="config/sfnonet.yaml" --config="sfno_linear_73chq_sc3_layers8_edim384_asgl2" --h_parallel_size=4 --w_parallel_size=1 --batch_size=64

Here we train the model on 256 GPUs, split horizontally across 4 ranks with a batch size of 64, which amounts to a local batch size of 1/4. Memory requirements are further reduced by the use of bf16 automatic mixed precision.

Inference:

In a similar fashion to training, inference can be called from the CLI by calling inference.py and handled by inferencer.py. To launch inference on the out-of-sample dataset, we can call:

mpirun -np 256 --allow-run-as-root python -u inference.py --amp_mode=bf16 --cuda_graph_mode=fwdbwd --multistep_count=1 --run_num="ngpu256_sp4" --yaml_config="config/sfnonet.yaml" --config="sfno_linear_73chq_sc3_layers8_edim384_asgl2" --h_parallel_size=4 --w_parallel_size=1 --batch_size=64

By default, the inference script will perform inference on the out-of-sample dataset specified

More about Makani

Project structure

The project is structured as follows:

makani
โ”œโ”€โ”€ ...
โ”œโ”€โ”€ config                  # configuration files, also known as recipes
โ”œโ”€โ”€ data_process            # data pre-processing such as computation of statistics
โ”œโ”€โ”€ datasets                # dataset utility scripts
โ”œโ”€โ”€ docker                  # scripts for building a docker image for training
โ”œโ”€โ”€ inference               # contains the inferencer
โ”œโ”€โ”€ mpu                     # utilities for model parallelism
โ”œโ”€โ”€ networks                # networks, contains definitions of various ML models
โ”œโ”€โ”€ tests                   # test files
โ”œโ”€โ”€ third_party/climt       # third party modules
โ”‚   โ””โ”€โ”€ zenith_angle.py     # computation of zenith angle
โ”œโ”€โ”€ utils                   # utilities
โ”‚   โ”œโ”€โ”€ dataloaders         # contains various dataloaders
โ”‚   โ”œโ”€โ”€ metrics             # metrics folder contains routines for scoring and benchmarking.
โ”‚   โ”œโ”€โ”€ ...
โ”‚   โ”œโ”€โ”€ comm.py             # comms module for orthogonal communicator infrastructure
โ”‚   โ”œโ”€โ”€ dataloader.py       # dataloader interface
โ”‚   โ”œโ”€โ”€ metric.py           # centralized metrics handler
โ”‚   โ”œโ”€โ”€ trainer_profile.py  # copy of trainer.py used for profiling
โ”‚   โ””โ”€โ”€ trainer.py          # main file for handling training
โ”œโ”€โ”€ ...
โ”œโ”€โ”€ inference.py            # CLI script for launching inference
โ”œโ”€โ”€ train.py                # CLI script for launching training
โ””โ”€โ”€ README.md               # this file

Model and Training configuration

Model training in Makani is specified through the use of .yaml files located in the config folder. The corresponding models are located in networks and registered in the get_model routine in networks/models.py. The following table lists the most important configuration options.

Configuration Key Description Options
nettype Network architecture. sfno, fno, afno, vit
loss Loss function. l2, geometric l2, ...
optimizer Optimizer to be used. Adam, AdamW
lr Initial learning rate. float > 0.0
batch_size Batch size. integer > 0
max_epochs Number of epochs to train for integer
scheduler Learning rate scheduler to be used. None, CosineAnnealing, ReduceLROnPlateau, StepLR
lr_warmup_steps Number of warmup steps for the learning rate scheduler. integer >= 0
weight_decay Weight decay. float
train_data_path Directory path which contains the training data. string
test_data_path Network architecture. string
exp_dir Directory path for ouputs such as model checkpoints. string
metadata_json_path Path to the metadata file data.json. string
channel_names Channels to be used for training. List[string]

For a more comprehensive overview, we suggest looking into existing .yaml configurations. More details about the available configurations can be found in this file.

Training data

Makani expects the training/test data in HDF5 format, where each file contains the data for an entire year. The dataloaders in Makani will then load the input inp and the target tar, which correspond to the state of the atmosphere at a given point in time and at a later time for the target. The time difference between input and target is determined by the parameter dt, which determines how many steps the two are apart. The physical time difference is determined by the temporal resolution dhours of the dataset.

Makani requires a metadata file named data.json, which describes important properties of the dataset such as the HDF5 variable name that contains the data. Another example are channels to load in the dataloader, which arespecified via channel names. The metadata file has the following structure:

{
    "dataset_name": "give this dataset a name",     # name of the dataset
    "attrs": {                                      # optional attributes, can contain anything you want
        "decription": "description of the dataset",
        "location": "location of your dataset"
    },
    "h5_path": "fields",                            # variable name of the data inside the hdf5 file
    "dims": ["time", "channel", "lat", "lon"],      # dimensions of fields contained in the dataset
    "dhours": 6,                                    # temporal resolution in hours
    "coord": {                                      # coordinates and channel descriptions
        "grid_type": "equiangular",                 # type of grid used in dataset: currently suppported choices are 'equiangular' and 'legendre-gauss'
        "lat": [0.0, 0.1, ...],                     # latitudinal grid coordinates
        "lon": [0.0, 0.1, ...],                     # longitudinal grid coordinates
        "channel": ["t2m", "u10", "v10", ...]       # names of the channels contained in the dataset
    }
}

The ERA5 dataset can be downloaded here.

Model packages

By default, Makani will save out a model package when training starts. Model packages allow easily contain all the necessary data to run the model. This includes statistics used to normalize inputs and outputs, unpredicted static channels and even the code which appends celestial features such as the cosine of the solar zenith angle. Read more about model packages here.

Contributing

Thanks for your interest in contributing. There are many ways to contribute to this project.

  • If you find a bug, let us know and open an issue. Even better, if you feel like fixing it and making a pull-request, we are incredibly grateful for that. ๐Ÿ™
  • If you feel like adding a feature, we encourage you to discuss it with us first, so we can guide you on how to best achieve it.

While this is a research project, we aim to have functional unit tests with decent coverage. We kindly ask you to implement unit tests if you add a new feature and it can be tested.

Further reading

Authors

The code was developed by Thorsten Kurth, Boris Bonev, Jean Kossaifi, Animashree Anandkumar, Kamyar Azizzadenesheli, Noah Brenowitz, Ashesh Chattopadhyay, Yair Cohen, David Hall, Peter Harrington, Pedram Hassanzadeh, Christian Hundt, Karthik Kashinath, Zongyi Li, Morteza Mardani, Jaideep Pathak, Mike Pritchard, David Pruitt, Sanjeev Raja, Shashank Subramanian.

References

[1] Bonev B., Kurth T., Hundt C., Pathak, J., Baust M., Kashinath K., Anandkumar A.; Spherical Fourier Neural Operators: Learning Stable Dynamics on the Sphere; arXiv 2306.0383, 2023.

[2] Pathak J., Subramanian S., Harrington P., Raja S., Chattopadhyay A., Mardani M., Kurth T., Hall D., Li Z., Azizzadenesheli K., Hassanzadeh P., Kashinath K., Anandkumar A.; FourCastNet: A Global Data-driven High-resolution Weather Model using Adaptive Fourier Neural Operators; arXiv 2202.11214, 2022.

Citation

If you use this package, please cite

@InProceedings{bonev2023sfno,
    title={Spherical {F}ourier Neural Operators: Learning Stable Dynamics on the Sphere},
    author={Bonev, Boris and Kurth, Thorsten and Hundt, Christian and Pathak, Jaideep and Baust, Maximilian and Kashinath, Karthik and Anandkumar, Anima},
    booktitle={Proceedings of the 40th International Conference on Machine Learning},
    pages={2806--2823},
    year={2023},
    volume={202},
    series={Proceedings of Machine Learning Research},
    month={23--29 Jul},
    publisher={PMLR},
}

@article{pathak2022fourcastnet,
    title={Fourcastnet: A global data-driven high-resolution weather model using adaptive fourier neural operators},
    author={Pathak, Jaideep and Subramanian, Shashank and Harrington, Peter and Raja, Sanjeev and Chattopadhyay, Ashesh and Mardani, Morteza and Kurth, Thorsten and Hall, David and Li, Zongyi and Azizzadenesheli, Kamyar and Hassanzadeh, Pedram and Kashinath, Karthik and Anandkumar, Animashree},
    journal={arXiv preprint arXiv:2202.11214},
    year={2022}
}

modulus-makani's People

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.