mlcommons / hpc Goto Github PK

View Code? Open in Web Editor NEW

37.0 37.0 15.0 2.92 MB

Reference implementations of MLPerf™ HPC training benchmarks

Home Page: https://mlcommons.org/en/groups/training-hpc/

License: Apache License 2.0

Python 20.24% Shell 0.40% Jupyter Notebook 79.28% Dockerfile 0.06% Makefile 0.01% Batchfile 0.01%

hpc's People

Contributors

Stargazers

Watchers

Forkers

total-sa sparticlesteve henrique azrael417 laserkelvin tabaer hyunjay chelseajohn zhen-xie paklui aerhao varunpav ar-nowaczynski jw447 saforem2 yansonggu

hpc's Issues

ResolvePackageNotFound error

Everything proceeds for the install until I reach this step:

$ conda env create -f env.yml
Collecting package metadata (repodata.json): done
Solving environment: failed

ResolvePackageNotFound: 
  - pytorch=1.8.1
  - pymatgen=2020.12.31

Put DeepCAM code directly into repo

Remove the submodule link and just put the code in directly.

Update to rev 1.0

Dear Sir/madam,

can we please update the repo to v1.0:

https://github.com/azrael417/mlperf-deepcam/tree/mlperf-hpc-v1.0

I think the best would be to not using sub-repos but instead pull and push it into mlcommons repo. Please let me know if you need my help.

Best regards
Thorsten

Links to rules are broken

The links pointing to the rules and npc rules delta documents are both broken.

Add paper reference to readme

Can we add the HPC training paper reference to the readme? Cody Coleman created a great example for training here: mlcommons/policies#86

Summary table on front page?

Can we get tables showing what is in the benchmark suite in the landing page? See https://github.com/mlcommons/inference for a great example with all the different rounds.

deepcam dummy wireup error

It's probably not a common use-case, but the "dummy" wireup method for deepcam doesn't seem to work.

Here's an example script at NERSC:

#!/bin/bash
#SBATCH -A nstaff_g
#SBATCH -q early_science
#SBATCH -C gpu
#SBATCH -J mlperf-deepcam
#SBATCH --nodes 1
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node 1
#SBATCH --cpus-per-task=32
#SBATCH --time 30
#SBATCH --image sfarrell/deepcam:ref-21.12

# Configuration
local_batch_size=2
batchnorm_group_size=1
data_dir="/global/cfs/cdirs/mpccc/gsharing/sfarrell/climate-data/All-Hist"
output_dir="$SCRATCH/deepcam/results"
run_tag="test_dummy_${SLURM_JOB_ID}"

srun --mpi=pmi2 shifter --module gpu \
       python ./train.py \
       --wireup_method "dummy" \
       --run_tag ${run_tag} \
       --data_dir_prefix ${data_dir} \
       --output_dir ${output_dir} \
       --model_prefix "segmentation" \
       --optimizer "LAMB" \
       --start_lr 0.0055 \
       --lr_schedule type="multistep",milestones="800",decay_rate="0.1" \
       --lr_warmup_steps 400 \
       --lr_warmup_factor 1. \
       --weight_decay 1e-2 \
       --logging_frequency 10 \
       --save_frequency 0 \
       --max_epochs 1 \
       --max_inter_threads 4 \
       --seed $(date +%s) \
       --batchnorm_group_size ${batchnorm_group_size} \
       --local_batch_size ${local_batch_size}

This gives a runtime error when constructing the DDP wrapper:

Traceback (most recent call last):
  File "./train.py", line 256, in <module>
    main(pargs)
  File "./train.py", line 167, in main
    ddp_net = DDP(net, device_ids=[device.index],
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 551, in __init__
    self.process_group = _get_default_group()
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 412, in _get_default_group
    raise RuntimeError(
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

[cosmoflow] [logging] Weight decay and l2 regularization differs by a factor of 2

Weight decay and l2 regularization differs by a factor of 2. (Refs)
I think the value of Weight Decay output in the following line should be "l2 * 2".

hpc/cosmoflow/models/cosmoflow.py

Line 52 in b796e7a

mllogger.event(key=mllog.constants.OPT_WEIGHT_DECAY, value=l2)

Refs

Boris's blog, "weight decay vs L2 regularization"
https://bbabenko.github.io/weight-decay/
Keras/issues/2717 "Is it the same adding weight decay to all the layers (including input and output layer) than adding the weight decay term to the cost function? #2717"
keras-team/keras#2717 (comment)