Git Product home page Git Product logo

Comments (12)

matrixoneken avatar matrixoneken commented on June 2, 2024

colossalai check -i

------------ Environment ------------
Colossal-AI version: 0.3.4
PyTorch version: 2.1.2
System CUDA version: 12.3
CUDA version required by PyTorch: 12.1

Note:

  1. The table above checks the versions of the libraries/tools in the current environment
  2. If the System CUDA version is N/A, you can set the CUDA_HOME environment variable to locate it
  3. If the CUDA version required by PyTorch is N/A, you probably did not install a CUDA-compatible PyTorch. This value is give by torch.version.cuda and you can go to https://pytorch.org/get-started/locally/ to download the correct version.

------------ CUDA Extensions AOT Compilation ------------
Found AOT CUDA Extension: ✓
PyTorch version used for AOT compilation: N/A
CUDA version used for AOT compilation: N/A

Note:

  1. AOT (ahead-of-time) compilation of the CUDA kernels occurs during installation when the environment variable CUDA_EXT=1 is set
  2. If AOT compilation is not enabled, stay calm as the CUDA kernels can still be built during runtime

------------ Compatibility ------------
PyTorch version match: N/A
System and PyTorch CUDA version match: x
System and Colossal-AI CUDA version match: N/A

Note:

  1. The table above checks the version compatibility of the libraries/tools in the current environment
    • PyTorch version mismatch: whether the PyTorch version in the current environment is compatible with the PyTorch version used for AOT compilation
    • System and PyTorch CUDA version match: whether the CUDA version in the current environment is compatible with the CUDA version required by PyTorch
    • System and Colossal-AI CUDA version match: whether the CUDA version in the current environment is compatible with the CUDA version used for AOT compilation

from colossalai.

flybird11111 avatar flybird11111 commented on June 2, 2024

Can you downgrade cuda to 12.1?

from colossalai.

matrixoneken avatar matrixoneken commented on June 2, 2024

@flybird11111 as your meaning, my System CUDA version should be 12.1?

from colossalai.

matrixoneken avatar matrixoneken commented on June 2, 2024

@flybird11111 I have downgraded cuda to 12.1. Please look as following

colossalai check -i

Installation Report

------------ Environment ------------
Colossal-AI version: 0.3.4
PyTorch version: 2.1.2
System CUDA version: 12.1
CUDA version required by PyTorch: 12.1

Note:

  1. The table above checks the versions of the libraries/tools in the current environment
  2. If the System CUDA version is N/A, you can set the CUDA_HOME environment variable to locate it
  3. If the CUDA version required by PyTorch is N/A, you probably did not install a CUDA-compatible PyTorch. This value is give by torch.version.cuda and you can go to https://pytorch.org/get-started/locally/ to download the correct version.

------------ CUDA Extensions AOT Compilation ------------
Found AOT CUDA Extension: ✓
PyTorch version used for AOT compilation: N/A
CUDA version used for AOT compilation: N/A

Note:

  1. AOT (ahead-of-time) compilation of the CUDA kernels occurs during installation when the environment variable CUDA_EXT=1 is set
  2. If AOT compilation is not enabled, stay calm as the CUDA kernels can still be built during runtime

------------ Compatibility ------------
PyTorch version match: N/A
System and PyTorch CUDA version match: ✓
System and Colossal-AI CUDA version match: N/A

Note:

  1. The table above checks the version compatibility of the libraries/tools in the current environment
    • PyTorch version mismatch: whether the PyTorch version in the current environment is compatible with the PyTorch version used for AOT compilation
    • System and PyTorch CUDA version match: whether the CUDA version in the current environment is compatible with the CUDA version required by PyTorch
    • System and Colossal-AI CUDA version match: whether the CUDA version in the current environment is compatible with the CUDA version used for AOT compilation

but when I run the example of resnet

ken@ken-pc:~/ColossalAI/examples/images/resnet$ colossalai run --nproc_per_node 1 train.py

I also get error as following, Can you help me? Thank you.

Traceback (most recent call last):
File "/home/ken/ColossalAI/examples/images/resnet/train.py", line 16, in
from colossalai.booster import Booster
File "/home/ken/miniconda3/lib/python3.11/site-packages/colossalai/booster/init.py", line 2, in
from .booster import Booster
File "/home/ken/miniconda3/lib/python3.11/site-packages/colossalai/booster/booster.py", line 17, in
from .plugin import Plugin
File "/home/ken/miniconda3/lib/python3.11/site-packages/colossalai/booster/plugin/init.py", line 1, in
from .gemini_plugin import GeminiPlugin
File "/home/ken/miniconda3/lib/python3.11/site-packages/colossalai/booster/plugin/gemini_plugin.py", line 29, in
from colossalai.shardformer import ShardConfig, ShardFormer
File "/home/ken/miniconda3/lib/python3.11/site-packages/colossalai/shardformer/init.py", line 1, in
from .shard import ShardConfig, ShardFormer
File "/home/ken/miniconda3/lib/python3.11/site-packages/colossalai/shardformer/shard/init.py", line 2, in
from .sharder import ModelSharder
File "/home/ken/miniconda3/lib/python3.11/site-packages/colossalai/shardformer/shard/sharder.py", line 10, in
from ..policies.auto_policy import get_autopolicy
File "/home/ken/miniconda3/lib/python3.11/site-packages/colossalai/shardformer/policies/auto_policy.py", line 6, in
from .base_policy import Policy
File "/home/ken/miniconda3/lib/python3.11/site-packages/colossalai/shardformer/policies/base_policy.py", line 14, in
from ..layer.normalization import BaseLayerNorm
File "/home/ken/miniconda3/lib/python3.11/site-packages/colossalai/shardformer/layer/init.py", line 2, in
from .embedding import Embedding1D, VocabParallelEmbedding1D
File "/home/ken/miniconda3/lib/python3.11/site-packages/colossalai/shardformer/layer/embedding.py", line 14, in
from colossalai.nn import init as init
File "/home/ken/miniconda3/lib/python3.11/site-packages/colossalai/nn/init.py", line 5, in
from .optimizer import *
File "/home/ken/miniconda3/lib/python3.11/site-packages/colossalai/nn/optimizer/init.py", line 1, in
from .cpu_adam import CPUAdam
File "/home/ken/miniconda3/lib/python3.11/site-packages/colossalai/nn/optimizer/cpu_adam.py", line 7, in
from colossalai.kernel.op_builder import ArmCPUAdamBuilder, CPUAdamBuilder
File "/home/ken/miniconda3/lib/python3.11/site-packages/colossalai/kernel/init.py", line 1, in
from .cuda_native import FusedScaleMaskSoftmax, LayerNorm, MultiHeadAttention
File "/home/ken/miniconda3/lib/python3.11/site-packages/colossalai/kernel/cuda_native/init.py", line 1, in
from .layer_norm import MixedFusedLayerNorm as LayerNorm
File "/home/ken/miniconda3/lib/python3.11/site-packages/colossalai/kernel/cuda_native/layer_norm.py", line 12, in
from colossalai.kernel.op_builder.layernorm import LayerNormBuilder
ModuleNotFoundError: No module named 'colossalai.kernel.op_builder'
[2023-12-25 14:26:40,806] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 3542) of binary: /home/ken/miniconda3/bin/python
Traceback (most recent call last):
File "/home/ken/miniconda3/bin/torchrun", line 8, in
sys.exit(main())
^^^^^^
File "/home/ken/miniconda3/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/ken/miniconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/home/ken/miniconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/home/ken/miniconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ken/miniconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures:
<NO_OTHER_FAILURES>

from colossalai.

flybird11111 avatar flybird11111 commented on June 2, 2024
image I haven't reproduced your issue.This looks like you are missing some files. https://colossalai.org/docs/get_started/installation, can you try to install colossalai with the main branch of the repository.

from colossalai.

matrixoneken avatar matrixoneken commented on June 2, 2024

@flybird11111 I have done like you, please see following image
1703557141(1)

and got new error 'ValueError: An IPv4 address cannot be in brackets' please following message. Thank you

colossalai run --nproc_per_node 1 train.py
/home/ken/miniconda3/lib/python3.11/site-packages/colossalai/kernel/cuda_native/mha/flash_attn_2.py:28: UserWarning: please install flash_attn from https://github.com/HazyResearch/flash-attention
warnings.warn("please install flash_attn from https://github.com/HazyResearch/flash-attention")
/home/ken/miniconda3/lib/python3.11/site-packages/colossalai/kernel/cuda_native/mha/mem_eff_attn.py:15: UserWarning: please install xformers from https://github.com/facebookresearch/xformers
warnings.warn("please install xformers from https://github.com/facebookresearch/xformers")
/home/ken/miniconda3/lib/python3.11/site-packages/colossalai/shardformer/layer/normalization.py:45: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused layernorm kernel
warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused layernorm kernel")
/home/ken/miniconda3/lib/python3.11/site-packages/colossalai/initialize.py:48: UserWarning: config is deprecated and will be removed soon.
warnings.warn("config is deprecated and will be removed soon.")
Traceback (most recent call last):
File "/home/ken/ColossalAI/examples/images/resnet/train.py", line 207, in
main()
File "/home/ken/ColossalAI/examples/images/resnet/train.py", line 131, in main
colossalai.launch_from_torch(config={})
File "/home/ken/miniconda3/lib/python3.11/site-packages/colossalai/initialize.py", line 172, in launch_from_torch
launch(
File "/home/ken/miniconda3/lib/python3.11/site-packages/colossalai/initialize.py", line 55, in launch
dist.init_process_group(rank=rank, world_size=world_size, backend=backend, init_method=init_method)
File "/home/ken/miniconda3/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 74, in wrapper
func_return = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ken/miniconda3/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1138, in init_process_group
rendezvous_iterator = rendezvous(
^^^^^^^^^^^
File "/home/ken/miniconda3/lib/python3.11/site-packages/torch/distributed/rendezvous.py", line 98, in rendezvous
return _rendezvous_helper(url, rank, world_size, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ken/miniconda3/lib/python3.11/site-packages/torch/distributed/rendezvous.py", line 60, in _rendezvous_helper
result = urlparse(url)
^^^^^^^^^^^^^
File "/home/ken/miniconda3/lib/python3.11/urllib/parse.py", line 395, in urlparse
splitresult = urlsplit(url, scheme, allow_fragments)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ken/miniconda3/lib/python3.11/urllib/parse.py", line 500, in urlsplit
_check_bracketed_host(bracketed_host)
File "/home/ken/miniconda3/lib/python3.11/urllib/parse.py", line 448, in _check_bracketed_host
raise ValueError(f"An IPv4 address cannot be in brackets")
ValueError: An IPv4 address cannot be in brackets
[2023-12-26 10:18:02,536] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 512) of binary: /home/ken/miniconda3/bin/python
Traceback (most recent call last):
File "/home/ken/miniconda3/bin/torchrun", line 8, in
sys.exit(main())
^^^^^^
File "/home/ken/miniconda3/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
(base) ken@ken-pc:~/ColossalAI/examples/images/resnet$
run(args)
File "/home/ken/miniconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/home/ken/miniconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ken/miniconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-12-26_10:18:02
host : ken-pc.
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 512)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Error: failed to run torchrun --nproc_per_node=1 --nnodes=1 --node_rank=0 --master_addr=127.0.0.1 --master_port=29500 train.py on 127.0.0.1, is localhost: True, exception: Encountered a bad command exit code!

from colossalai.

flybird11111 avatar flybird11111 commented on June 2, 2024

Hi, could you please post the command you ran?

from colossalai.

matrixoneken avatar matrixoneken commented on June 2, 2024

@flybird11111
I cd to the dir of the resnet example and run following command
colossalai run --nproc_per_node 1 train.py

from colossalai.

flybird11111 avatar flybird11111 commented on June 2, 2024

Hi, it seems something wrong with python 3.11, we will fix it soon, can you change the python version to python 3.10, It works in my environment.

from colossalai.

matrixoneken avatar matrixoneken commented on June 2, 2024

@flybird11111
I use "conda create -n py310 python=3.10" and "source activate py310"
Then I use pip install the lib from source code
I try to run colossalai run --nproc_per_node 1 train.py in reset dir
And I get following error:
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'

But I have installed the cude toolkit, I can find cuda in /usr/local/cuda,please see following image
image

And I have add env in ~/.bashrc
image
How can I run the example? Thank you

from colossalai.

flybird11111 avatar flybird11111 commented on June 2, 2024

export PATH=/usr/local/cuda/bin:${PATH}, how about adding it to ~/.bashrc

from colossalai.

matrixoneken avatar matrixoneken commented on June 2, 2024

@flybird11111
using vim ~/.bashrc
append at the bottom. please see following text. Thank you

i_f ! shopt -oq posix; then
if [ -f /usr/share/bash-completion/bash_completion ]; then
. /usr/share/bash-completion/bash_completion
elif [ -f /etc/bash_completion ]; then
. /etc/bash_completion
fi
fi

>>> conda initialize >>>

!! Contents within this block are managed by 'conda init' !!

__conda_setup="$('/home/ken/miniconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
eval "$__conda_setup"
else
if [ -f "/home/ken/miniconda3/etc/profile.d/conda.sh" ]; then
. "/home/ken/miniconda3/etc/profile.d/conda.sh"
else
export PATH="/home/ken/miniconda3/bin:$PATH"
fi
fi
unset __conda_setup

<<< conda initialize <<<

export LD_LIBRARY_PATH=/usr/local/cuda/lib64
export PATH=$PATH:/usr/local/cuda/bin
export CUDA_HOME=/usr/local/cuda

from colossalai.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.