mlcommons / hpc Goto Github PK
View Code? Open in Web Editor NEWReference implementations of MLPerf™ HPC training benchmarks
Home Page: https://mlcommons.org/en/groups/training-hpc/
License: Apache License 2.0
Reference implementations of MLPerf™ HPC training benchmarks
Home Page: https://mlcommons.org/en/groups/training-hpc/
License: Apache License 2.0
Everything proceeds for the install until I reach this step:
$ conda env create -f env.yml
Collecting package metadata (repodata.json): done
Solving environment: failed
ResolvePackageNotFound:
- pytorch=1.8.1
- pymatgen=2020.12.31
Remove the submodule link and just put the code in directly.
Dear Sir/madam,
can we please update the repo to v1.0:
https://github.com/azrael417/mlperf-deepcam/tree/mlperf-hpc-v1.0
I think the best would be to not using sub-repos but instead pull and push it into mlcommons repo. Please let me know if you need my help.
Best regards
Thorsten
The links pointing to the rules and npc rules delta documents are both broken.
Can we add the HPC training paper reference to the readme? Cody Coleman created a great example for training here: mlcommons/policies#86
Can we get tables showing what is in the benchmark suite in the landing page? See https://github.com/mlcommons/inference for a great example with all the different rounds.
It's probably not a common use-case, but the "dummy" wireup method for deepcam doesn't seem to work.
Here's an example script at NERSC:
#!/bin/bash
#SBATCH -A nstaff_g
#SBATCH -q early_science
#SBATCH -C gpu
#SBATCH -J mlperf-deepcam
#SBATCH --nodes 1
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node 1
#SBATCH --cpus-per-task=32
#SBATCH --time 30
#SBATCH --image sfarrell/deepcam:ref-21.12
# Configuration
local_batch_size=2
batchnorm_group_size=1
data_dir="/global/cfs/cdirs/mpccc/gsharing/sfarrell/climate-data/All-Hist"
output_dir="$SCRATCH/deepcam/results"
run_tag="test_dummy_${SLURM_JOB_ID}"
srun --mpi=pmi2 shifter --module gpu \
python ./train.py \
--wireup_method "dummy" \
--run_tag ${run_tag} \
--data_dir_prefix ${data_dir} \
--output_dir ${output_dir} \
--model_prefix "segmentation" \
--optimizer "LAMB" \
--start_lr 0.0055 \
--lr_schedule type="multistep",milestones="800",decay_rate="0.1" \
--lr_warmup_steps 400 \
--lr_warmup_factor 1. \
--weight_decay 1e-2 \
--logging_frequency 10 \
--save_frequency 0 \
--max_epochs 1 \
--max_inter_threads 4 \
--seed $(date +%s) \
--batchnorm_group_size ${batchnorm_group_size} \
--local_batch_size ${local_batch_size}
This gives a runtime error when constructing the DDP wrapper:
Traceback (most recent call last):
File "./train.py", line 256, in <module>
main(pargs)
File "./train.py", line 167, in main
ddp_net = DDP(net, device_ids=[device.index],
File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 551, in __init__
self.process_group = _get_default_group()
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 412, in _get_default_group
raise RuntimeError(
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
Weight decay and l2 regularization differs by a factor of 2. (Refs)
I think the value of Weight Decay output in the following line should be "l2 * 2".
hpc/cosmoflow/models/cosmoflow.py
Line 52 in b796e7a
Boris's blog, "weight decay vs L2 regularization"
https://bbabenko.github.io/weight-decay/
Keras/issues/2717 "Is it the same adding weight decay to all the layers (including input and output layer) than adding the weight decay term to the cost function? #2717"
keras-team/keras#2717 (comment)
The deepcam readme still describes the dependency on an external package for the LR warmup scheduler:
https://github.com/mlcommons/hpc/blob/main/deepcam/README.md#before-you-run
My understanding is this is no longer needed because the code was updated to have its own implementation. @azrael417 can you confirm and update the readme?
@sparticlesteve
It seems that there is no seed logging in the CosmoFlow reference code. Could you consider adding seed logging?
Details coming soon.
Steve, please can you send me or check in a yaml file for the small example. For some reason when I try to modify parameters to use the small dataset its not properly working
Thanks
https://mlcommons.org/en/credits/ needs to be updated.
Can we fill in https://docs.google.com/spreadsheets/d/17q_OtvVI_C5ET0shUTObHwiplUVRj5gFkU1dTkU7AcQ/edit#gid=0 for the three HPC benchmarks?
Our benchmark documentation could use a bit of improvement, and I think an initial thing to do would be to harmonize the readme document structure across the benchmarks. Then we can work on fleshing out content.
It's too old compared to the gpu one: https://github.com/mlcommons/hpc/blob/main/cosmoflow/builds/Dockerfile.cpu_mpich
Hi! I'm currently fiddling around with the deepcam benchmark and plan on using NVIDIA DALI on the deepcam dataset. Right now, I'm using this script to convert the .h5 files to .npy, as DALI does not currently support HDF5 (see NVIDIA/DALI#1252).
Are there any better/more efficient ways to deal with this? Thank you in advance.
The top-level readme has outdated information. We should update this and follow the template at https://github.com/mlcommons/training
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.