turingaicloud / quickstart Goto Github PK

View Code? Open in Web Editor NEW

73.0 4.0 6.0 2.34 MB

https://tacc.ust.hk

Python 97.66% Shell 2.34%

tacc tcloud

quickstart's Introduction

Turing AI Cloud Quick Start

Workflow Overview

The above picture illustrates the submission and debug workflows of TACC job.

Creating a TACC account

Before using tcloud SDK, please make sure that you have applied for a TACC account and submitted your public key to TACC. You may generate SSH public key according to the steps. To apply for a TACC account, please visit our website .

Installing `tcloud` SDK

Download tcloud SDK
Download the latest tcloud SDK from tags.
Install tcloud SDK
Place setup.sh and tcloud in the same directory, and run setup.sh.

Submitting Your First TACC Job

CLI Tool Initialization

First, you need to configure your TACC credentials. You can do this by running the tcloud config command:
```
$ tcloud config [-u/--username] MYUSERNAME
$ tcloud config [-f/--file] MYPRIVATEFILEPATH
```

Then, run tcloud init command to obtain the latest cluster hardware information from TACC cluster.

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
tacc*        up   infinite      5  alloc 10-0-7-[18-19],10-0-8-[18-19]
tacc*        up   infinite     19   idle 10-0-2-[18-19],10-0-3-[10-13]

Download Sample Job

You can use this link to download our example code.

Submit a Job

Each job requires a main.py with tuxiv.conf

main.py: Your machine learning training code.
tuxiv.conf: Detail about tuxiv.conf

After tcloud is configured correctly, you can try to submit your first job.

Go to the example folder in your terminal.

Run tcloud submit command.

~/Dow/quickstart-master/example/helloworld ❯ tcloud submit
Start parsing tuxiv.conf...
building file list ...
8 files to consider
helloworld/
helloworld/run.sh
        151 100%    0.00kB/s    0:00:00 (xfer#1, to-check=5/8)
helloworld/configurations/
helloworld/configurations/citynet.sh
          12 100%   11.72kB/s    0:00:00 (xfer#2, to-check=2/8)
helloworld/configurations/conda.yaml
        107 100%  104.49kB/s    0:00:00 (xfer#3, to-check=1/8)
helloworld/configurations/run.slurm
        278 100%  271.48kB/s    0:00:00 (xfer#4, to-check=0/8)

sent 429 bytes  received 144 bytes  382.00 bytes/sec
total size is 1071  speedup is 1.87
Submitted batch job 2000
Job helloworld submitted.

Retrive Your Job Status and Output

In this section, we provide two methods to monitor the job log.

After training, you can use tcloud ls [filepath] to find the output files

cat

You can configure your log path in the tuxiv.conf. The default path is slurm_log/slurm-jobid.out.
```
tcloud cat slurm_log/slurm-jobid.out
```
In the helloworld example, the tuxiv.conf file specifies the log path as slurm_log/hello.log
download

You can use tcloud download [filepath].

Note that you can only read and download files in USERDIR, and the files in WORKDIR may be removed after the job is finished.
```
tcloud download slurm_log/slurm-jobid.out
```

Manage your environment

tcloud uses Conda to manage your dependencies. All dependencies will be installed through conda. Please specify the required conda channel to meet the installation requirements. In tcloud, we offer two ways of environment management:

One-off Environment. A new environment with different dependencies will be created every time you submit a task to TACC. If you do not specify an environment name and your dependencies configuration does not change between two consecutive submissions in tuxiv.conf, we will reuse the previous environment to save time. This is the default behavior.
```
environment:
  # name:       # do not specify environment name
  dependencies:
      - pytorch=1.6.0
      - torchvision=0.7.0
  channels: pytorch
```
Persistent Environment. You can create a dedicated environment for each project. It needs to set a different environment name in tuxiv.conf for each project. When you change your dependencies configuration with an exist environment, tcloud will update this environment in stead of creating a new one. Learn how to do this in tuxiv.conf documentation environment part.
```
environment:
  name: torch-env   # dedicated environment name
  dependencies:
      - pytorch=1.6.0
      - torchvision=0.7.0
  channels: pytorch
```

Demo video

The following videos will help you use tcloud CLI to begin your TACC journey: demo video.

Examples

Basic examples are provided under the example folder. These examples include: HelloWorld, TensorFlow, PyTorch and MXNet.

FAQ

quickstart's People

Contributors

Stargazers

Watchers

Forkers

philipgeng zf-z lianqing11 garethtian turzobose panpan0000

quickstart's Issues

Any limitation for GPU usage?

Hi! Seems TACC is a platform providing free GPU resources...? Any limitations? Such as the disk size or GPU running time? Thanks!

Publickey Update

Hi there, how do we check which public key and update the publickey for logging into TACC?

How to delete files in USERDIR

Hi, I have uploaded some files to my USERDIR. But I do not know how to delete them through tcloud command. There seems no commands to do that. Could you give me a hand? Thanks in advance.

ncclSystemError with multi-node multi-gpu training

Problem Description

I am trying to run multi-node distributed training with pytorch. More specifically, I am using torchrun as distributed launcher, with deepspeed. The code works fine with single-node, multi-gpu setting, but NCCL error occurs when multi-node is used.

Launch Script

The launch script looks like this:

# tcloud_run.sh
GPU_PER_NODE=$1
torchrun --nproc_per_node $GPU_PER_NODE --nnodes $SLURM_NNODES --node_rank $SLURM_PROCID \
    --master_addr $MASTER_ADDR --master_port $MASTER_PORT \
    -m <my python script> <my args>

and in tuxiv.conf, the entrypoint is sh tcloud_run.sh 2.

Environment Setup

# cuda
export CUDA_HOME=/mnt/data/cuda/cuda-11.3.1
export LD_LIBRARY_PATH=/mnt/data/cuda/cuda-11.3.1/lib64:$LD_LIBRARY_PATH
export PATH=/mnt/data/cuda/cuda-11.3.1/bin:$PATH
# nccl
export NCCL_SOCKET_IFNAME=eth0

Debug Log

Attached below is the slurm_log with NCCL_DEBUG=INFO set.

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Nodelist:=  10-0-4-[10-11]
Number of nodes:=  2
Ntasks per node:=  1
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
MASTER_PORT=13819
WORLD_SIZE=2
MASTER_ADDR=10-0-4-10
NCCL_SOCKET_IFNAME=eth0
0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.0.4.10  netmask 255.255.255.0  broadcast 10.0.4.255
        ether 6e:7c:11:bc:44:4c  txqueuelen 1000  (Ethernet)
        RX packets 131917859378  bytes 195829243616837 (195.8 TB)
        RX errors 0  dropped 11298050  overruns 0  frame 0
        TX packets 133110220747  bytes 197254884153813 (197.2 TB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.17.0.11  netmask 255.255.0.0  broadcast 172.17.255.255
        ether 02:42:ac:11:00:0b  txqueuelen 0  (Ethernet)
        RX packets 3850508  bytes 3285467306 (3.2 GB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 2899989  bytes 5647254656 (5.6 GB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 4775558139  bytes 188213232702545 (188.2 TB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 4775558139  bytes 188213232702545 (188.2 TB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.0.4.11  netmask 255.255.255.0  broadcast 10.0.4.255
        ether 56:7f:10:19:19:af  txqueuelen 1000  (Ethernet)
        RX packets 132366442963  bytes 195861266566055 (195.8 TB)
        RX errors 0  dropped 7418680  overruns 0  frame 0
        TX packets 131532889938  bytes 194952543072346 (194.9 TB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.17.0.12  netmask 255.255.0.0  broadcast 172.17.255.255
        ether 02:42:ac:11:00:0c  txqueuelen 0  (Ethernet)
        RX packets 792708  bytes 1213789120 (1.2 GB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 889655  bytes 278028450 (278.0 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 4980505670  bytes 186223504814541 (186.2 TB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 4980505670  bytes 186223504814541 (186.2 TB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Run started at:-
Fri Mar 24 03:18:49 UTC 2023
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
[2023-03-24 03:20:15,284] [INFO] [comm.py:652:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
10-0-4-10:74872:74872 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.11<0>
10-0-4-10:74872:74872 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
10-0-4-10:74872:74872 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_2:1/RoCE [2]mlx5_49:1/RoCE [3]mlx5_39:1/RoCE [4]mlx5_67:1/RoCE [5]mlx5_29:1/RoCE [6]mlx5_57:1/RoCE [7]mlx5_19:1/RoCE [8]mlx5_47:1/RoCE [9]mlx5_37:1/RoCE [10]mlx5_65:1/RoCE [11]mlx5_12:1/RoCE [12]mlx5_27:1/RoCE [13]mlx5_55:1/RoCE [14]mlx5_17:1/RoCE [15]mlx5_113:1/RoCE ; OOB eth0:172.17.0.11<0>
10-0-4-10:74872:74872 [0] NCCL INFO Using network IB
NCCL version 2.10.3+cuda11.3
10-0-4-10:74873:74873 [1] NCCL INFO Bootstrap : Using eth0:172.17.0.11<0>
10-0-4-11:62452:62452 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.12<0>
10-0-4-11:62453:62453 [1] NCCL INFO Bootstrap : Using eth0:172.17.0.12<0>
10-0-4-10:74873:74873 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
10-0-4-11:62452:62452 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
10-0-4-11:62453:62453 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
10-0-4-11:62452:62452 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_2:1/RoCE [2]mlx5_49:1/RoCE [3]mlx5_39:1/RoCE [4]mlx5_67:1/RoCE [5]mlx5_29:1/RoCE [6]mlx5_57:1/RoCE [7]mlx5_19:1/RoCE [8]mlx5_47:1/RoCE [9]mlx5_37:1/RoCE [10]mlx5_65:1/RoCE [11]mlx5_12:1/RoCE [12]mlx5_27:1/RoCE [13]mlx5_55:1/RoCE [14]mlx5_17:1/RoCE [15]mlx5_113:1/RoCE ; OOB eth0:172.17.0.12<0>
10-0-4-11:62452:62452 [0] NCCL INFO Using network IB
10-0-4-10:74873:74873 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_2:1/RoCE [2]mlx5_49:1/RoCE [3]mlx5_39:1/RoCE [4]mlx5_67:1/RoCE [5]mlx5_29:1/RoCE [6]mlx5_57:1/RoCE [7]mlx5_19:1/RoCE [8]mlx5_47:1/RoCE [9]mlx5_37:1/RoCE [10]mlx5_65:1/RoCE [11]mlx5_12:1/RoCE [12]mlx5_27:1/RoCE [13]mlx5_55:1/RoCE [14]mlx5_17:1/RoCE [15]mlx5_113:1/RoCE ; OOB eth0:172.17.0.11<0>
10-0-4-10:74873:74873 [1] NCCL INFO Using network IB
10-0-4-11:62453:62453 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_2:1/RoCE [2]mlx5_49:1/RoCE [3]mlx5_39:1/RoCE [4]mlx5_67:1/RoCE [5]mlx5_29:1/RoCE [6]mlx5_57:1/RoCE [7]mlx5_19:1/RoCE [8]mlx5_47:1/RoCE [9]mlx5_37:1/RoCE [10]mlx5_65:1/RoCE [11]mlx5_12:1/RoCE [12]mlx5_27:1/RoCE [13]mlx5_55:1/RoCE [14]mlx5_17:1/RoCE [15]mlx5_113:1/RoCE ; OOB eth0:172.17.0.12<0>
10-0-4-11:62453:62453 [1] NCCL INFO Using network IB
10-0-4-11:62453:62550 [1] NCCL INFO Could not enable P2P between dev 1(=1e000) and dev 0(=1d000)
10-0-4-11:62453:62550 [1] NCCL INFO Could not enable P2P between dev 0(=1d000) and dev 1(=1e000)
10-0-4-11:62453:62550 [1] NCCL INFO Could not enable P2P between dev 1(=1e000) and dev 0(=1d000)
10-0-4-11:62453:62550 [1] NCCL INFO Could not enable P2P between dev 0(=1d000) and dev 1(=1e000)
10-0-4-11:62452:62548 [0] NCCL INFO Could not enable P2P between dev 1(=1e000) and dev 0(=1d000)
10-0-4-11:62452:62548 [0] NCCL INFO Could not enable P2P between dev 0(=1d000) and dev 1(=1e000)
10-0-4-11:62452:62548 [0] NCCL INFO Could not enable P2P between dev 1(=1e000) and dev 0(=1d000)
10-0-4-11:62452:62548 [0] NCCL INFO Could not enable P2P between dev 0(=1d000) and dev 1(=1e000)
10-0-4-10:74872:74954 [0] NCCL INFO Could not enable P2P between dev 1(=1c000) and dev 0(=1b000)
10-0-4-10:74872:74954 [0] NCCL INFO Could not enable P2P between dev 0(=1b000) and dev 1(=1c000)
10-0-4-10:74873:74971 [1] NCCL INFO Could not enable P2P between dev 1(=1c000) and dev 0(=1b000)
10-0-4-10:74873:74971 [1] NCCL INFO Could not enable P2P between dev 0(=1b000) and dev 1(=1c000)
10-0-4-10:74872:74954 [0] NCCL INFO Could not enable P2P between dev 1(=1c000) and dev 0(=1b000)
10-0-4-10:74872:74954 [0] NCCL INFO Could not enable P2P between dev 0(=1b000) and dev 1(=1c000)
10-0-4-10:74873:74971 [1] NCCL INFO Could not enable P2P between dev 1(=1c000) and dev 0(=1b000)
10-0-4-10:74873:74971 [1] NCCL INFO Could not enable P2P between dev 0(=1b000) and dev 1(=1c000)
10-0-4-10:74873:74971 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
10-0-4-10:74873:74971 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff
10-0-4-10:74872:74954 [0] NCCL INFO Channel 00/02 :    0   1   2   3
10-0-4-11:62452:62548 [0] NCCL INFO Trees [0] 3/-1/-1->2->0 [1] 3/0/-1->2->-1
10-0-4-10:74872:74954 [0] NCCL INFO Channel 01/02 :    0   1   2   3
10-0-4-10:74872:74954 [0] NCCL INFO Trees [0] 1/2/-1->0->-1 [1] 1/-1/-1->0->2
10-0-4-11:62453:62550 [1] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
10-0-4-10:74872:74954 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff
10-0-4-10:74873:74971 [1] NCCL INFO Could not enable P2P between dev 1(=1c000) and dev 0(=1b000)
10-0-4-11:62453:62550 [1] NCCL INFO Could not enable P2P between dev 1(=1e000) and dev 0(=1d000)
10-0-4-10:74872:74954 [0] NCCL INFO Channel 00 : 3[1e000] -> 0[1b000] [receive] via NET/IB/1
10-0-4-11:62452:62548 [0] NCCL INFO Channel 00 : 1[1c000] -> 2[1d000] [receive] via NET/IB/1
10-0-4-10:74873:74971 [1] NCCL INFO Could not enable P2P between dev 1(=1c000) and dev 0(=1b000)
10-0-4-11:62453:62550 [1] NCCL INFO Could not enable P2P between dev 1(=1e000) and dev 0(=1d000)
10-0-4-10:74872:74954 [0] NCCL INFO Channel 01 : 3[1e000] -> 0[1b000] [receive] via NET/IB/1
10-0-4-10:74872:74954 [0] NCCL INFO Could not enable P2P between dev 0(=1b000) and dev 1(=1c000)
10-0-4-10:74872:74954 [0] NCCL INFO Channel 00 : 0[1b000] -> 1[1c000] via direct shared memory
10-0-4-10:74872:74954 [0] NCCL INFO Could not enable P2P between dev 0(=1b000) and dev 1(=1c000)
10-0-4-10:74872:74954 [0] NCCL INFO Channel 01 : 0[1b000] -> 1[1c000] via direct shared memory
10-0-4-11:62452:62548 [0] NCCL INFO Channel 01 : 1[1c000] -> 2[1d000] [receive] via NET/IB/1
10-0-4-11:62452:62548 [0] NCCL INFO Could not enable P2P between dev 0(=1d000) and dev 1(=1e000)
10-0-4-11:62452:62548 [0] NCCL INFO Channel 00 : 2[1d000] -> 3[1e000] via direct shared memory
10-0-4-11:62452:62548 [0] NCCL INFO Could not enable P2P between dev 0(=1d000) and dev 1(=1e000)
10-0-4-11:62452:62548 [0] NCCL INFO Channel 01 : 2[1d000] -> 3[1e000] via direct shared memory
10-0-4-10:74873:74971 [1] NCCL INFO Channel 00 : 1[1c000] -> 2[1d000] [send] via NET/IB/1
10-0-4-11:62453:62550 [1] NCCL INFO Channel 00 : 3[1e000] -> 0[1b000] [send] via NET/IB/1
10-0-4-10:74873:74971 [1] NCCL INFO Channel 01 : 1[1c000] -> 2[1d000] [send] via NET/IB/1
10-0-4-10:74873:74971 [1] NCCL INFO Connected all rings
10-0-4-10:74873:74971 [1] NCCL INFO Could not enable P2P between dev 1(=1c000) and dev 0(=1b000)
10-0-4-11:62453:62550 [1] NCCL INFO Channel 01 : 3[1e000] -> 0[1b000] [send] via NET/IB/1
10-0-4-10:74873:74971 [1] NCCL INFO Channel 00 : 1[1c000] -> 0[1b000] via direct shared memory
10-0-4-10:74873:74971 [1] NCCL INFO Could not enable P2P between dev 1(=1c000) and dev 0(=1b000)
10-0-4-10:74873:74971 [1] NCCL INFO Channel 01 : 1[1c000] -> 0[1b000] via direct shared memory
10-0-4-11:62453:62550 [1] NCCL INFO Connected all rings
10-0-4-11:62453:62550 [1] NCCL INFO Could not enable P2P between dev 1(=1e000) and dev 0(=1d000)
10-0-4-11:62453:62550 [1] NCCL INFO Channel 00 : 3[1e000] -> 2[1d000] via direct shared memory
10-0-4-11:62453:62550 [1] NCCL INFO Could not enable P2P between dev 1(=1e000) and dev 0(=1d000)
10-0-4-11:62453:62550 [1] NCCL INFO Channel 01 : 3[1e000] -> 2[1d000] via direct shared memory

10-0-4-11:62452:62548 [0] misc/ibvwrap.cc:284 NCCL WARN Call to ibv_modify_qp failed with error No such device
10-0-4-11:62452:62548 [0] NCCL INFO transport/net_ib.cc:415 -> 2

10-0-4-10:74872:74954 [0] misc/ibvwrap.cc:284 NCCL WARN Call to ibv_modify_qp failed with error No such device
10-0-4-10:74872:74954 [0] NCCL INFO transport/net_ib.cc:415 -> 2
10-0-4-10:74872:74954 [0] NCCL INFO transport/net_ib.cc:528 -> 2
10-0-4-10:74872:74954 [0] NCCL INFO include/net.h:22 -> 2
10-0-4-11:62452:62548 [0] NCCL INFO transport/net_ib.cc:528 -> 2
10-0-4-11:62452:62548 [0] NCCL INFO include/net.h:22 -> 2
10-0-4-11:62452:62548 [0] NCCL INFO transport/net.cc:234 -> 2
10-0-4-11:62452:62548 [0] NCCL INFO transport.cc:119 -> 2
10-0-4-10:74872:74954 [0] NCCL INFO transport/net.cc:234 -> 2
10-0-4-10:74872:74954 [0] NCCL INFO transport.cc:119 -> 2
10-0-4-10:74872:74954 [0] NCCL INFO init.cc:778 -> 2
10-0-4-11:62452:62548 [0] NCCL INFO init.cc:778 -> 2
10-0-4-11:62452:62548 [0] NCCL INFO init.cc:904 -> 2
10-0-4-10:74872:74954 [0] NCCL INFO init.cc:904 -> 2
10-0-4-10:74872:74954 [0] NCCL INFO group.cc:72 -> 2 [Async thread]
10-0-4-11:62452:62548 [0] NCCL INFO group.cc:72 -> 2 [Async thread]
[2023-03-24 03:20:35,665] [INFO] [partition_parameters.py:415:__exit__] finished initializing model with 0.13B parameters
Traceback (most recent call last):
  File "/mnt/home/schoiaj/.Miniconda3/envs/coh_torch/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/mnt/home/schoiaj/.Miniconda3/envs/coh_torch/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/mnt/home/schoiaj/WORKDIR/chain-of-hindsight-torch/coh/coh_train.py", line 102, in <module>
    main()
  File "/mnt/home/schoiaj/WORKDIR/chain-of-hindsight-torch/coh/coh_train.py", line 57, in main
    model = AutoModelForCausalLM.from_pretrained(args.model_name, cache_dir=args.cache_dir)
Traceback (most recent call last):
  File "/mnt/home/schoiaj/.Miniconda3/envs/coh_torch/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 471, in from_pretrained
    return model_class.from_pretrained(
  File "/mnt/home/schoiaj/.Miniconda3/envs/coh_torch/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2493, in from_pretrained
    model = cls(config, *model_args, **model_kwargs)
  File "/mnt/home/schoiaj/.Miniconda3/envs/coh_torch/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 355, in wrapper
    f(module, *args, **kwargs)
  File "/mnt/home/schoiaj/.Miniconda3/envs/coh_torch/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 652, in __init__
    self.model = LlamaModel(config)
  File "/mnt/home/schoiaj/.Miniconda3/envs/coh_torch/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 355, in wrapper
    f(module, *args, **kwargs)
  File "/mnt/home/schoiaj/.Miniconda3/envs/coh_torch/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/mnt/home/schoiaj/.Miniconda3/envs/coh_torch/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 456, in __init__
    self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
  File "/mnt/home/schoiaj/.Miniconda3/envs/coh_torch/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 363, in wrapper
    self._post_init_method(module)
  File "/mnt/home/schoiaj/.Miniconda3/envs/coh_torch/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/mnt/home/schoiaj/.Miniconda3/envs/coh_torch/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 754, in _post_init_method
    dist.broadcast(param, 0, self.ds_process_group)
  File "/mnt/home/schoiaj/.Miniconda3/envs/coh_torch/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 123, in log_wrapper
    return func(*args, **kwargs)
  File "/mnt/home/schoiaj/WORKDIR/chain-of-hindsight-torch/coh/coh_train.py", line 102, in <module>
    main()
  File "/mnt/home/schoiaj/.Miniconda3/envs/coh_torch/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 228, in broadcast
    return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
  File "/mnt/home/schoiaj/WORKDIR/chain-of-hindsight-torch/coh/coh_train.py", line 57, in main
    model = AutoModelForCausalLM.from_pretrained(args.model_name, cache_dir=args.cache_dir)
  File "/mnt/home/schoiaj/.Miniconda3/envs/coh_torch/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 78, in broadcast
    return torch.distributed.broadcast(tensor=tensor,
  File "/mnt/home/schoiaj/.Miniconda3/envs/coh_torch/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1193, in broadcast
    work = default_pg.broadcast([tensor], opts)
  File "/mnt/home/schoiaj/.Miniconda3/envs/coh_torch/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 471, in from_pretrained
    return model_class.from_pretrained(
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1659484808560/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled system error, NCCL version 2.10.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is connection closure by a peer.
  File "/mnt/home/schoiaj/.Miniconda3/envs/coh_torch/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2493, in from_pretrained
    model = cls(config, *model_args, **model_kwargs)
  File "/mnt/home/schoiaj/.Miniconda3/envs/coh_torch/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 355, in wrapper
    f(module, *args, **kwargs)
  File "/mnt/home/schoiaj/.Miniconda3/envs/coh_torch/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 652, in __init__
    self.model = LlamaModel(config)
  File "/mnt/home/schoiaj/.Miniconda3/envs/coh_torch/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 355, in wrapper
    f(module, *args, **kwargs)
  File "/mnt/home/schoiaj/.Miniconda3/envs/coh_torch/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 456, in __init__
    self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
  File "/mnt/home/schoiaj/.Miniconda3/envs/coh_torch/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 363, in wrapper
    self._post_init_method(module)
  File "/mnt/home/schoiaj/.Miniconda3/envs/coh_torch/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 754, in _post_init_method
    dist.broadcast(param, 0, self.ds_process_group)
  File "/mnt/home/schoiaj/.Miniconda3/envs/coh_torch/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 123, in log_wrapper
    return func(*args, **kwargs)
  File "/mnt/home/schoiaj/.Miniconda3/envs/coh_torch/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 228, in broadcast
    return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
  File "/mnt/home/schoiaj/.Miniconda3/envs/coh_torch/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 78, in broadcast
    return torch.distributed.broadcast(tensor=tensor,
  File "/mnt/home/schoiaj/.Miniconda3/envs/coh_torch/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1193, in broadcast
    work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1659484808560/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled system error, NCCL version 2.10.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is connection closure by a peer.

example PyTorch failed

Sample: https://github.com/turingaicloud/quickstart/tree/master/example/PyTorch

Error message: Failed to run cmd " /mnt/home/my-name/.Miniconda3/bin/conda env create -f /mnt/home/my-name/WORKDIR/PyTorch/configurations/conda.yaml -n torch-env

# tcloud submit
Start parsing tuxiv.conf...
sending incremental file list
PyTorch/
PyTorch/run.sh
            192 100%    0.00kB/s    0:00:00 (xfr#1, to-chk=6/9)
PyTorch/configurations/
PyTorch/configurations/citynet.sh
              0 100%    0.00kB/s    0:00:00 (xfr#3, to-chk=2/9)
PyTorch/configurations/conda.yaml
            189 100%   16.78kB/s    0:00:00 (xfr#4, to-chk=1/9)
PyTorch/configurations/run.slurm
            322 100%   28.59kB/s    0:00:00 (xfr#5, to-chk=0/9)

sent 496 bytes  received 145 bytes  427.33 bytes/sec
total size is 7,360  speedup is 11.48
Collecting package metadata (repodata.json): done
Solving environment: done


==> WARNING: A newer version of conda exists. <==
  current version: 4.9.2
  latest version: 23.10.0

Please update conda by running

    $ conda update -n base -c defaults conda



Downloading and Extracting Packages
typing_extensions-4. | 28 KB     | ##################################### | 100%
openjpeg-2.4.0       | 331 KB    | ##################################### | 100%
python-3.6.9         | 30.2 MB   | ##################################### | 100%
coverage-5.5         | 258 KB    | ##################################### | 100%
mkl_random-1.1.1     | 327 KB    | ##################################### | 100%
olefile-0.46         | 48 KB     | ##################################### | 100%
intel-openmp-2022.1. | 4.5 MB    | ##################################### | 100%
ninja-1.10.2         | 8 KB      | ##################################### | 100%
mkl_fft-1.3.0        | 170 KB    | ##################################### | 100%
libpng-1.6.39        | 304 KB    | ##################################### | 100%
jpeg-9b              | 214 KB    | ##################################### | 100%
mkl-2020.2           | 138.3 MB  | ##################################### | 100%
libprotobuf-3.17.2   | 2.0 MB    | ##################################### | 100%
certifi-2021.5.30    | 139 KB    | ##################################### | 100%
blas-1.0             | 6 KB      | ##################################### | 100%
zlib-1.2.13          | 103 KB    | ##################################### | 100%
werkzeug-2.0.3       | 221 KB    | ##################################### | 100%
libwebp-base-1.3.2   | 387 KB    | ##################################### | 100%
ffmpeg-4.3           | 9.9 MB    | ##################################### | 100%
lcms2-2.12           | 312 KB    | ##################################### | 100%
libiconv-1.16        | 736 KB    | ##################################### | 100%
dataclasses-0.8      | 22 KB     | ##################################### | 100%
pip-21.2.2           | 1.8 MB    | ##################################### | 100%
torchaudio-0.9.0     | 4.4 MB    | ##################################### | 100%
setuptools-58.0.4    | 788 KB    | ##################################### | 100%
libedit-3.1.20221030 | 181 KB    | ##################################### | 100%
tk-8.6.12            | 3.0 MB    | ##################################### | 100%
tensorboard-1.15.0   | 3.2 MB    | ##################################### | 100%
cudatoolkit-11.1.74  | 1.19 GB   | ####################################9 | 100% [Tcloud Error] 2023/11/24 tcloudcli.go:152: Failed to run cmd "  /mnt/home/my-name/.Miniconda3/bin/conda env create -f /mnt/home/my-name/WORKDIR/PyTorch/configurations/conda.yaml -n torch-env
 " wait: remote command exited without exit status or exit signal
[Tcloud Error] 2023/11/24 tcloudcli.go:414: Failed to run cmd in CondaCreate
[Tcloud Error] 2023/11/24 tcloudcli.go:347: Failed to create conda env

Dependencies Installation issue

Issue:

When calling tcloud init, some of the dependencies in the tuxiv.conf cannot be installed. Adding package to channel doesn't work ethier.

Reproduce:

Things in the conda-forge package
Eg: opencv-python
Some 3rd party dependent framework, such as fastai

Expected behavior:
Packages can be installed via tuxiv.conf

Current work around:

adding pip command in the entrypoints to install the package when submitting job
or pulling github repo to install the package directly in the job files.

One node can be allocated to only one user

Issue:
Even when a user has only requested 1 cpu or 1 gpu and more resources are available, other users can't apply to the same node.

Reproduce:
In tuxiv.conf:

job:
    name: test
    general:
      - nodes=1
      - ntasks=1
      - cpus-per-task=1
      - gres=gpu:1

Expected behavior:
Node 1 has 39 CPUs and 7 GPUs available, and other users can use the rest of resources

cuda not available with different torch version

Issue:
When creating new environment with pytorch version > 1.9.1, torchvision > 0.10.1 will result torch.cuda.is_avaiable() return false.

Reproduce:

environment:
name: XXXXXXXX
channels:
- conda-forge
- defaults
- pytorch
dependencies:
- pytorch==1.10.1
- torchvision==0.11.2
- cudatoolkit=11.0 # or 11.3 / 11.2 / not explicit set cudatoolkit

Once the above environment is set, simply run a python script as following:

import torch
print(torch.__version__)
print(torch.cuda.is_avaiable()) # return False
print(torch.cuda.version) # print None

any pytorch version higher than 1.9.1 will result the above issue, or without explicit set cudatoolkit to 11.0. The default cudatoolkit for conda environment after pytorch 1.8.0 is 11.3.

Expected behavior:

import torch
print(torch.__version__)
print(torch.cuda.is_avaiable()) # Should return True
print(torch.cuda.version) # Should return cuda version, for example, 11.0

cuda should be available with any pytorch version, or trigger error code prompt to user that TACC does not support pytorch version higher than 1.9.1

Possible cause of the issue

Please check the reference issue from pytorch (pytorch/pytorch#51858)

It seems to me that there is driver mismatch in TACC cuda servers, where nvcc is 11.0, nvidia-smi is 11.3 and pytorch cudatookit must match 11.0, therefore 1.9.1 is the last pytorch build that runs on 11.0.

Current working build

environment:
name: XXXXXXXX
channels:
- conda-forge
- defaults
- pytorch
dependencies:
- pytorch==1.9.1
- torchvision==0.10.1
- cudatoolkit=11.0

How to compile a code through make?

Hi, I would like to compile a code in the format of cuda by make, and run it. But I do not know how to do that through tcloud. In addition, all examples are about how to use python. Thus I have to ask you for helps. Could you give me some suggestions? Thanks very much.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.