aws-neuron / aws-neuron-parallelcluster-samples Goto Github PK
View Code? Open in Web Editor NEWLicense: Other
License: Other
While reproducing the llama2 training in this neuronx-nemo-megatron-llamav2-job.md,I hit link not found issue.
Then I found that this Red pyjama dataset can be found and downloaded from huggingface datasets site.
I am using a single node to lanuch a pretraining job following https://github.com/aws-neuron/aws-neuron-parallelcluster-samples/blob/master/examples/jobs/neuronx-nemo-megatron-llamav2-job.md
For the step of running the command to build the Megatron helper module
cd ~
python3 -c "from nemo.collections.nlp.data.language_modeling.megatron.dataset_utils import compile_helper; \
compile_helper()"
it takes 15 minutes after displaying the following outputs
2023-Oct-03 04:58:41.0157 23005:23005 ERROR TDRV:tdrv_get_dev_info No neuron device available
[NeMo W 2023-10-03 04:58:46 optimizers:70] Could not import distributed_fused_adam optimizer from Apex
[NeMo W 2023-10-03 04:58:49 experimental:27] Module <class 'nemo.collections.nlp.data.language_modeling.megatron.megatron_batch_samplers.MegatronPretrainingRandomBatchSampler'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2023-10-03 04:58:50 experimental:27] Module <class 'nemo.collections.nlp.models.text_normalization_as_tagging.thutmose_tagger.ThutmoseTaggerModel'> is experimental, not ready for production and is not fully supported. Use at your own risk.
make: Entering directory '/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/nemo/collections/nlp/data/language_modeling/megatron'
g++ -O3 -Wall -shared -std=c++11 -fPIC -fdiagnostics-color -I/usr/include/python3.8 -I/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/pybind11/include helpers.cpp -o helpers.cpython-38-x86_64-linux-gnu.so
I am wondering if there is something wrong.
Updated:
After 40 minutes, there is still not any outputs, so I terminated them.
The README to create parallel cluster does not specify that bucket name neuron-s3
needs to be replaced with a customer bucket name. The default neuron-s3
bucket name does not work due to Forbidden Access. The Custom Script Update does talk about how to update bucket-name and use a custom script, but it does not indicate that it is required.
aws s3 ls s3://neuron-s3/pcluster/post-install-scripts/neuron-installation/v2.13.0/u20/pt/install_neuron.sh
An error occurred (AccessDenied) when calling the ListObjectsV2 operation: Access Denied
This document shows that the best performance in llama2-7B-pretrain is achieved with TP=8, DP=64, and PP=1.
But ,how can I set the DP?
If you are are following the instructions at
https://github.com/aws-neuron/aws-neuron-parallelcluster-samples#train-a-model-on-aws-trn1-parallelcluster
Nothing mentions that you need to install awscli.
You get an error:
{
"message": "Unable to locate credentials"
}
However, the instructions at https://github.com/aws/aws-parallelcluster specifically say that you need to run
$ pip3 install awscli
and
aws configure
Maybe you want to add the instructions to https://docs.aws.amazon.com/parallelcluster/latest/ug/install-v3-virtual-environment.html or to the link above in this repository.
could you please give us full training code for llama2?
I was able to clone this repo to my local environment. However, I cannot seem to push it:
git push --set-upstream origin initial_version
remote: Write access to repository not granted.
This GPT 3 26B pretraining tutorial crashes after 12-18 hours of pre-training, specifically on Ubuntu 22.04 stack. It works on Ubuntu 20.04 stack.
Not all processes in the cluster crash, but 1 or more processes in the 4-node cluster crash with the same error.
Key error stack trace is as follows:
2024-Jan-11 19:01:15.502423 18506:21252 ERROR ENC:ncclNetRegMr failed neuronNetRegMr request to NCCL
2024-Jan-11 19:01:15.502435 18506:21252 ERROR ENC:configure_net_connector [rank 0, channel 0] failed to register channel buffer, addr: 0x0x7fd881000000, len: 50331648
2024-Jan-11 19:01:15.503417 18506:21252 ERROR ENC:alg_ring_init [nec_dev 13] failed to configure network connector for RING
2024-Jan-11 19:01:15.499989 18506:21240 ERROR ENC:ncclNetRegMr failed neuronNetRegMr request to NCCL
2024-Jan-11 19:01:15.503432 18506:21240 ERROR ENC:configure_net_connector [rank 0, channel 0] failed to register channel buffer, addr: 0x0x7fd895000000, len: 50331648
2024-Jan-11 19:01:15.501323 18506:21247 ERROR ENC:ncclNetRegMr failed neuronNetRegMr request to NCCL
2024-Jan-11 19:01:15.503428 18506:21252 ERROR ENC:init_ring_algorithm [nec_dev 13] failed to alg_ring_init for RING
2024-Jan-11 19:01:15.505482 18506:21240 ERROR ENC:alg_ring_init [nec_dev 1] failed to configure network connector for RING
2024-Jan-11 19:01:15.505492 18506:21240 ERROR ENC:init_ring_algorithm [nec_dev 1] failed to alg_ring_init for RING
2024-Jan-11 19:01:15.505497 18506:21240 ERROR ENC:enc_init_comm [rank 0] failed to init ring algorithm
2024-Jan-11 19:01:15.505503 18506:21240 ERROR ENC:enc_init_replica_groups [nec_dev 1] failed to init ENC comm
2024-Jan-11 19:01:15.505509 18506:21240 ERROR ENC:enc_load_operations [nec_dev 1] failed to init replica groups
2024-Jan-11 19:01:15.505514 18506:21240 ERROR TDRV:v2_cc_execute [nec_dev 1] failed to load operations
2024-Jan-11 19:01:15.505519 18506:21240 ERROR NMGR:dlr_infer Failed to prep collectives execution, err: 1
2024-Jan-11 19:01:15.505550 18506:21240 ERROR NMGR:kmgr_async_exec_default_exec_status_callbackExec id 0 for model 10017 on worker 1 failed with fatal status 1... aborting.
python3: /local/p4clients/pkgbuild-Gx12v/workspace/src/KaenaRuntime/kmgr/kmgr_async_exec.cc:27: void kmgr_async_exec_default_exec_status_callback(void*, uint32_t, uint32_t, uint64_t, NRT_STATUS): Assertion `0' failed.
2024-Jan-11 19:01:15.504471 18506:21247 ERROR ENC:configure_net_connector [rank 0, channel 0] failed to register channel buffer, addr: 0x0x7fd88d000000, len: 50331648
2024-Jan-11 19:01:15.508072 18506:21247 ERROR ENC:alg_ring_init [nec_dev 8] failed to configure network connector for RING
2024-Jan-11 19:01:15.505482 18506:21252 ERROR ENC:enc_init_comm [rank 0] failed to init ring algorithm
2024-Jan-11 19:01:15.508081 18506:21247 ERROR ENC:init_ring_algorithm [nec_dev 8] failed to alg_ring_init for RING
Linux ip-172-31-73-214 6.2.0-1017-aws #17~22.04.1-Ubuntu SMP Fri Nov 17 21:07:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
apex @ file:///home/ubuntu/neuronx-nemo-megatron/build/apex-0.1-py3-none-any.whl#sha256=882cc65b94adc92e20864e468d82f072395571a54155472d77f1961b846cd9b2
aws-neuronx-runtime-discovery==2.9
libneuronxla==0.5.669
nemo_toolkit @ file:///home/ubuntu/neuronx-nemo-megatron/build/nemo_toolkit-1.14.0-py3-none-any.whl#sha256=dad4a2ecf0d65d03eb481542cffaaabe58c9960b25e3c58725cd7a0aad516cef
neuronx-cc==2.12.54.0+f631c2365
neuronx-hwm==2.12.0.0+422c9037c
torch-neuronx==1.13.1.1.13.0
torch-xla==1.13.1+torchneurond
aws-neuronx-collectives 2.19.7.0-530fb3064 amd64 neuron_ccom built using CMake
aws-neuronx-dkms 2.15.9.0 amd64 aws-neuronx driver in DKMS format.
aws-neuronx-oci-hook 2.2.45.0 amd64 neuron_oci_hook built using CMake
aws-neuronx-runtime-lib 2.19.5.0-97e2d271b amd64 neuron_runtime built using CMake
aws-neuronx-tools 2.16.1.0 amd64 Neuron profile and debug tools
OpenMPI
The head node type is trn1.2xlarge
and was created using this CFN template, with EFS and FSx file-systems enabled
The cluster nodes were of type trn1.32xlarge
and were created using this CFN template
THIS IS THE LUANCH SCRIPT AFTTER RUNNING THE neuron_parallel_compile
#!/bin/bash
set -o pipefail
[[ $# -ne 1 ]] && echo "usage: $0 script" && exit 1
SCRIPT=$1
echo "Training script: $SCRIPT"
[[ -z $MASTER_ADDR ]] && echo "MASTER_ADDR is not set" && exit 1
[[ -z $HOSTFILE ]] && echo "HOSTFILE is not set" && exit 1
NUM_PARALLEL=4
JOB_ID="neuron_nemo_megatron_gpt_23b"
export LOGS_DIR="$HOME/fsx/neuronx_logs/$JOB_ID"
mkdir -p $LOGS_DIR
export CACHE_DIR="$HOME/fsx/neuronx_cache/$JOB_ID"
mkdir -p $CACHE_DIR
export XDG_CACHE_HOME="$HOME/efs/.cache/$JOB_ID"
mkdir -p $XDG_CACHE_HOME
export DATA_PATH="$HOME/efs/examples_datasets/gpt2"
[[ ! -d $DATA_PATH ]] && echo "$DATA_PATH not found" && exit 1
export WORK_DIR=$HOME/efs/git/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling
export PATH='/opt/aws/neuron/bin:/opt/amazon/openmpi/bin:/opt/amazon/efa/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin'
LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:/opt/aws/neuron/lib"
LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:/opt/amazon/efa/lib"
LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:/opt/amazon/efa/lib64"
LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:/opt/amazon/openmpi/lib64"
export LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:/usr/local/lib"
mpirun -np $NUM_PARALLEL --verbose \
--hostfile $HOSTFILE \
-bind-to none -map-by slot \
--mca plm_rsh_no_tree_spawn 1 -mca pml ob1 -mca btl ^openib -mca btl_tcp_if_exclude lo,docker0 \
--mca hwloc_base_binding_policy none --mca rmaps_base_mapping_policy slot \
--mca orte_keep_fqdn_hostnames t \
--report-child-jobs-separately \
--display-map --tag-output --timestamp-output \
-wdir $WORK_DIR \
-x PATH \
-x LD_LIBRARY_PATH \
-x PYTHONUNBUFFERED=1 \
-x PYTHONIOENCODING=UTF-8 \
-x LANG=C.UTF-8 \
-x LC_ALL=C.UTF-8 \
-x MASTER_ADDR \
-x DATA_PATH \
-x CACHE_DIR \
-x LOGS_DIR \
-x WORK_DIR \
-x SCRIPT \
-x XDG_CACHE_HOME \
-x TOKENIZERS_PARALLELISM=false \
bash -c "source /home/ubuntu/aws_neuron_nemo_megatron/bin/activate && \
./$SCRIPT"
This is the test.sh
script:
#!/usr/bin/env bash
source ./train_setup.sh
: ${SEQ_LENGTH:=2048}
: ${HS:=4096}
: ${TP:=8}
: ${PP:=1}
: ${N_LAYERS:=32}
: ${N_AH:=32}
: ${UBS:=1}
: ${ACT_CHKPNT_GRANULARITY:=full}
: ${GBS_MULTIPLE:=32}
GBS=$((NTASKS*GBS_MULTIPLE))
: ${TRAIN_ITERS:=300000}
FFN_HS=$(($HS*4))
echo "SEQ_LEN=$SEQ_LENGTH, HS=$HS, FFN_HS=$FFN_HS TP=$TP PP=$PP N_LAYERS=$N_LAYERS N_AH=$N_AH GBS=$GBS UBS=$UBS TRAIN_ITERS=$TRAIN_ITERS"
$MAYBE_COMPILE torchrun $DISTRIBUTED_ARGS megatron_gpt_pretraining.py \
--config-path=conf \
--config-name=megatron_gpt_config \
trainer.devices=$PROCESSES_PER_NODE \
trainer.num_nodes=$NTASKS \
trainer.max_epochs=null \
trainer.max_steps=$TRAIN_ITERS\
trainer.val_check_interval=$(($TRAIN_ITERS+1)) \
trainer.log_every_n_steps=1 \
trainer.limit_val_batches=1 \
trainer.limit_test_batches=1 \
trainer.accumulate_grad_batches=1 \
trainer.precision=32 \
model.megatron_amp_O2=$megatron_amp_O2 \
model.micro_batch_size=$UBS \
model.global_batch_size=$GBS \
model.tensor_model_parallel_size=$TP \
model.pipeline_model_parallel_size=$PP \
model.max_position_embeddings=$SEQ_LENGTH \
model.encoder_seq_length=$SEQ_LENGTH \
model.hidden_size=$HS \
model.ffn_hidden_size=$FFN_HS \
model.num_layers=$N_LAYERS \
model.num_attention_heads=$N_AH \
model.init_method_std=0.021 \
model.hidden_dropout=0.1 \
model.layernorm_epsilon=1e-5 \
model.tokenizer.vocab_file=$DATA_PATH/gpt2-vocab.json \
model.tokenizer.merge_file=$DATA_PATH/gpt2-merges.txt \
model.data.data_prefix=[1.0,$DATA_PATH/my-gpt2_text_document] \
model.data.num_workers=1 \
model.data.seq_length=$SEQ_LENGTH \
model.optim.name=$OPTIM_NAME \
model.optim.capturable=True \
model.optim.lr=0.00015 \
model.optim.betas=[0.9,0.95] \
model.optim.weight_decay=0.01 \
model.optim.sched.name=CosineAnnealing \
model.optim.sched.warmup_steps=750 \
model.optim.sched.constant_steps=80000 \
model.optim.sched.min_lr=1.0e-5 \
model.sequence_parallel=True \
model.activations_checkpoint_granularity=$ACT_CHKPNT_GRANULARITY \
model.activations_checkpoint_method=uniform \
model.activations_checkpoint_num_layers=1 \
+model.save_xser=True \
exp_manager.create_tensorboard_logger=$CREATE_TB_LOGGER \
exp_manager.resume_if_exists=False \
exp_manager.resume_ignore_no_checkpoint=False \
exp_manager.create_checkpoint_callback=$CHECKPOINT_CALLBACK \
exp_manager.explicit_log_dir=$EXPLICIT_LOGDIR \
+exp_manager.checkpoint_callback_params.train_time_interval=3600 \
model.use_cpu_initialization=True 2>&1 | tee -a $LOG_PATH/log
exit 0
This is the train_setup.sh
script:
#!/usr/bin/env bash
set -o pipefail
set -e
ulimit -n 65535
export FI_EFA_USE_DEVICE_RDMA=1
export FI_PROVIDER=efa
export FI_EFA_FORK_SAFE=1
if [ -v SLURM_NNODES ]
then
# SLURM runs
sudo sysctl -w net.ipv4.ip_local_reserved_ports=41000
IPS=""
for h in $(scontrol show hostname); do
IPS="$IPS $(nslookup $h | awk '/^Address: / { print $2 }')";
done
HOSTS=(${IPS//\ / })
NODEID=$SLURM_NODEID
NTASKS=$SLURM_NTASKS
export MASTER_ADDR=${HOSTS[0]}
export NEMO_EXPM_VERSION=$SLURM_JOB_ID
export EXPLICIT_LOGDIR=null
: ${SLURM_RESTART_COUNT:=0}
LOG_PATH=logs/$SLURM_JOB_ID/$SLURM_RESTART_COUNT/$NODEID/
mkdir -p $LOG_PATH
export NEURON_COMPILE_CACHE_URL="$HOME/neuron_cache" # Place cache on shared storage to reduce redundant compilations
# Make sure to install latest runtime
./setup.sh 2>&1 | tee $LOG_PATH/setup.log
elif [ -v OMPI_COMM_WORLD_RANK ]
then
# MPI
[[ -z $MASTER_ADDR ]] && echo "MASTER_ADDR is not set" && exit 1
TOKEN=`curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600"`
PRIMARY_MAC=$(curl -H "X-aws-ec2-metadata-token: $TOKEN" -s http://169.254.169.254/latest/meta-data/mac)
export CCOM_SOCKET_IFNAME=$(ip -o link show | grep -F "link/ether $PRIMARY_MAC" | awk -F'[ :]+' '{print $2}')
NODEID=$OMPI_COMM_WORLD_RANK
NTASKS=$OMPI_COMM_WORLD_SIZE
export EXPLICIT_LOGDIR=$LOGS_DIR
LOG_PATH=$LOGS_DIR/$NODEID/
mkdir -p $LOG_PATH
export NEURON_COMPILE_CACHE_URL=$CACHE_DIR/$NODEID # Place cache on shared storage to reduce redundant compilations
else
# Single-node, non-SLURM, non-MPI runs
HOSTS=(localhost)
NODEID=0
NTASKS=1
export MASTER_ADDR=${HOSTS[0]}
export NEMO_EXPM_VERSION=$(date "+%Y-%m-%d_%H-%M-%S")
export EXPLICIT_LOGDIR=null
LOG_PATH=./nemo_experiments/logs
mkdir -p $LOG_PATH
fi
export HYDRA_FULL_ERROR=1
export PROCESSES_PER_NODE=32
export MASTER_PORT=41000
export NEURON_RT_EXEC_TIMEOUT=10
export DISTRIBUTED_ARGS="--nproc_per_node $PROCESSES_PER_NODE --nnodes $NTASKS --node_rank $NODEID --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
echo $DISTRIBUTED_ARGS
export BUCKET_CAP_MB=1024
export NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS=5
export NEURON_TRANSFER_WITH_STATIC_RING_OPS=""
export MALLOC_ARENA_MAX=128
export TF_NUM_INTEROP_THREADS=1024
export XLA_THREAD_POOL_SIZE=4
export XLA_IO_THREAD_POOL_SIZE=4
export NEURON_RT_STOCHASTIC_ROUNDING_EN=1
#training_precision is one of 'bf16SR', 'megatron_amp_O2', 'fp32_OptStates'
#training_precision = "bf16SR", uses BF16 + Stochastic Rounding
#training_precision = "megatron_amp_O2", master weights and optimizer states are stored in fp32, model weights in bf16
#training_precision = "fp32_OptStates", optimizer states are stored in fp32, model weights in bf16
training_precision="bf16SR"
if [[ $training_precision == "bf16SR" ]];then
echo using BF16 SR
export XLA_USE_BF16=1
export NEURON_CC_FLAGS="--model-type transformer --distribution-strategy=nemo --enable-mixed-precision-accumulation"
export OPTIM_NAME=adamw
export megatron_amp_O2=false
elif [[ $training_precision == "megatron_amp_O2" ]]; then
echo using megatron_amp_O2
export XLA_DOWNCAST_BF16=1
export NEURON_CC_FLAGS="--model-type transformer --distribution-strategy=nemo --enable-mixed-precision-accumulation"
export OPTIM_NAME=adamw
export megatron_amp_O2=true
elif [[ $training_precision == "fp32_OptStates" ]]; then
echo using FP32 Optimizer States
export XLA_DOWNCAST_BF16=1
export NEURON_CC_FLAGS="--model-type transformer --distribution-strategy=nemo --enable-mixed-precision-accumulation"
export OPTIM_NAME=adamw_fp32OptState
export megatron_amp_O2=false
else
echo Incorrect Training Precision Provided
fi
export CREATE_TB_LOGGER=True
export CHECKPOINT_CALLBACK=True
if [ "$COMPILE" = "1" ]; then
echo "compiling only run"
MAYBE_COMPILE="neuron_parallel_compile"
export TRAIN_ITERS=3
CREATE_TB_LOGGER=False
CHECKPOINT_CALLBACK=False
export MASTER_PORT=41001
fi
~/efs
and FSx for Lustre file-system mounted under ~/fsx
..ssh/config
as follows:
Host *
StrictHostKeyChecking no
source ~/aws_neuron_nemo_megatron/bin/activate
sudo mkdir -p ~/efs/git; sudo chown -R ubuntu:ubuntu ~/efs/git
sudo mkdir -p ~/efs/examples_datasets/gpt2/; sudo chown -R ubuntu:ubuntu ~/efs/examples_datasets/gpt2/
~/efs/examples_datasets/gpt2/
cd ~/efs/git; git clone https://github.com/aws-neuron/neuronx-nemo-megatron.git
cd ~/efs/git/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling
slots=1
for each of the four nodes. Set the path to the hostfile in the environment variable export HOSTFILE=
and set the IP address of one of the cluster nodes in the environment variable export MASTER_ADDR=
./pretrain_openmpi.sh gpt_23b.sh 1>/tmp/a.out 2>&1 &
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.