Comments (12)
colossalai check -i
------------ Environment ------------
Colossal-AI version: 0.3.4
PyTorch version: 2.1.2
System CUDA version: 12.3
CUDA version required by PyTorch: 12.1
Note:
- The table above checks the versions of the libraries/tools in the current environment
- If the System CUDA version is N/A, you can set the CUDA_HOME environment variable to locate it
- If the CUDA version required by PyTorch is N/A, you probably did not install a CUDA-compatible PyTorch. This value is give by torch.version.cuda and you can go to https://pytorch.org/get-started/locally/ to download the correct version.
------------ CUDA Extensions AOT Compilation ------------
Found AOT CUDA Extension: ✓
PyTorch version used for AOT compilation: N/A
CUDA version used for AOT compilation: N/A
Note:
- AOT (ahead-of-time) compilation of the CUDA kernels occurs during installation when the environment variable CUDA_EXT=1 is set
- If AOT compilation is not enabled, stay calm as the CUDA kernels can still be built during runtime
------------ Compatibility ------------
PyTorch version match: N/A
System and PyTorch CUDA version match: x
System and Colossal-AI CUDA version match: N/A
Note:
- The table above checks the version compatibility of the libraries/tools in the current environment
- PyTorch version mismatch: whether the PyTorch version in the current environment is compatible with the PyTorch version used for AOT compilation
- System and PyTorch CUDA version match: whether the CUDA version in the current environment is compatible with the CUDA version required by PyTorch
- System and Colossal-AI CUDA version match: whether the CUDA version in the current environment is compatible with the CUDA version used for AOT compilation
from colossalai.
Can you downgrade cuda to 12.1?
from colossalai.
@flybird11111 as your meaning, my System CUDA version should be 12.1?
from colossalai.
@flybird11111 I have downgraded cuda to 12.1. Please look as following
colossalai check -i
Installation Report
------------ Environment ------------
Colossal-AI version: 0.3.4
PyTorch version: 2.1.2
System CUDA version: 12.1
CUDA version required by PyTorch: 12.1
Note:
- The table above checks the versions of the libraries/tools in the current environment
- If the System CUDA version is N/A, you can set the CUDA_HOME environment variable to locate it
- If the CUDA version required by PyTorch is N/A, you probably did not install a CUDA-compatible PyTorch. This value is give by torch.version.cuda and you can go to https://pytorch.org/get-started/locally/ to download the correct version.
------------ CUDA Extensions AOT Compilation ------------
Found AOT CUDA Extension: ✓
PyTorch version used for AOT compilation: N/A
CUDA version used for AOT compilation: N/A
Note:
- AOT (ahead-of-time) compilation of the CUDA kernels occurs during installation when the environment variable CUDA_EXT=1 is set
- If AOT compilation is not enabled, stay calm as the CUDA kernels can still be built during runtime
------------ Compatibility ------------
PyTorch version match: N/A
System and PyTorch CUDA version match: ✓
System and Colossal-AI CUDA version match: N/A
Note:
- The table above checks the version compatibility of the libraries/tools in the current environment
- PyTorch version mismatch: whether the PyTorch version in the current environment is compatible with the PyTorch version used for AOT compilation
- System and PyTorch CUDA version match: whether the CUDA version in the current environment is compatible with the CUDA version required by PyTorch
- System and Colossal-AI CUDA version match: whether the CUDA version in the current environment is compatible with the CUDA version used for AOT compilation
but when I run the example of resnet
ken@ken-pc:~/ColossalAI/examples/images/resnet$ colossalai run --nproc_per_node 1 train.py
I also get error as following, Can you help me? Thank you.
Traceback (most recent call last):
File "/home/ken/ColossalAI/examples/images/resnet/train.py", line 16, in
from colossalai.booster import Booster
File "/home/ken/miniconda3/lib/python3.11/site-packages/colossalai/booster/init.py", line 2, in
from .booster import Booster
File "/home/ken/miniconda3/lib/python3.11/site-packages/colossalai/booster/booster.py", line 17, in
from .plugin import Plugin
File "/home/ken/miniconda3/lib/python3.11/site-packages/colossalai/booster/plugin/init.py", line 1, in
from .gemini_plugin import GeminiPlugin
File "/home/ken/miniconda3/lib/python3.11/site-packages/colossalai/booster/plugin/gemini_plugin.py", line 29, in
from colossalai.shardformer import ShardConfig, ShardFormer
File "/home/ken/miniconda3/lib/python3.11/site-packages/colossalai/shardformer/init.py", line 1, in
from .shard import ShardConfig, ShardFormer
File "/home/ken/miniconda3/lib/python3.11/site-packages/colossalai/shardformer/shard/init.py", line 2, in
from .sharder import ModelSharder
File "/home/ken/miniconda3/lib/python3.11/site-packages/colossalai/shardformer/shard/sharder.py", line 10, in
from ..policies.auto_policy import get_autopolicy
File "/home/ken/miniconda3/lib/python3.11/site-packages/colossalai/shardformer/policies/auto_policy.py", line 6, in
from .base_policy import Policy
File "/home/ken/miniconda3/lib/python3.11/site-packages/colossalai/shardformer/policies/base_policy.py", line 14, in
from ..layer.normalization import BaseLayerNorm
File "/home/ken/miniconda3/lib/python3.11/site-packages/colossalai/shardformer/layer/init.py", line 2, in
from .embedding import Embedding1D, VocabParallelEmbedding1D
File "/home/ken/miniconda3/lib/python3.11/site-packages/colossalai/shardformer/layer/embedding.py", line 14, in
from colossalai.nn import init as init
File "/home/ken/miniconda3/lib/python3.11/site-packages/colossalai/nn/init.py", line 5, in
from .optimizer import *
File "/home/ken/miniconda3/lib/python3.11/site-packages/colossalai/nn/optimizer/init.py", line 1, in
from .cpu_adam import CPUAdam
File "/home/ken/miniconda3/lib/python3.11/site-packages/colossalai/nn/optimizer/cpu_adam.py", line 7, in
from colossalai.kernel.op_builder import ArmCPUAdamBuilder, CPUAdamBuilder
File "/home/ken/miniconda3/lib/python3.11/site-packages/colossalai/kernel/init.py", line 1, in
from .cuda_native import FusedScaleMaskSoftmax, LayerNorm, MultiHeadAttention
File "/home/ken/miniconda3/lib/python3.11/site-packages/colossalai/kernel/cuda_native/init.py", line 1, in
from .layer_norm import MixedFusedLayerNorm as LayerNorm
File "/home/ken/miniconda3/lib/python3.11/site-packages/colossalai/kernel/cuda_native/layer_norm.py", line 12, in
from colossalai.kernel.op_builder.layernorm import LayerNormBuilder
ModuleNotFoundError: No module named 'colossalai.kernel.op_builder'
[2023-12-25 14:26:40,806] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 3542) of binary: /home/ken/miniconda3/bin/python
Traceback (most recent call last):
File "/home/ken/miniconda3/bin/torchrun", line 8, in
sys.exit(main())
^^^^^^
File "/home/ken/miniconda3/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/ken/miniconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/home/ken/miniconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/home/ken/miniconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ken/miniconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train.py FAILED
Failures:
<NO_OTHER_FAILURES>
from colossalai.
from colossalai.
@flybird11111 I have done like you, please see following image
and got new error 'ValueError: An IPv4 address cannot be in brackets' please following message. Thank you
colossalai run --nproc_per_node 1 train.py
/home/ken/miniconda3/lib/python3.11/site-packages/colossalai/kernel/cuda_native/mha/flash_attn_2.py:28: UserWarning: please install flash_attn from https://github.com/HazyResearch/flash-attention
warnings.warn("please install flash_attn from https://github.com/HazyResearch/flash-attention")
/home/ken/miniconda3/lib/python3.11/site-packages/colossalai/kernel/cuda_native/mha/mem_eff_attn.py:15: UserWarning: please install xformers from https://github.com/facebookresearch/xformers
warnings.warn("please install xformers from https://github.com/facebookresearch/xformers")
/home/ken/miniconda3/lib/python3.11/site-packages/colossalai/shardformer/layer/normalization.py:45: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused layernorm kernel
warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused layernorm kernel")
/home/ken/miniconda3/lib/python3.11/site-packages/colossalai/initialize.py:48: UserWarning: config
is deprecated and will be removed soon.
warnings.warn("config
is deprecated and will be removed soon.")
Traceback (most recent call last):
File "/home/ken/ColossalAI/examples/images/resnet/train.py", line 207, in
main()
File "/home/ken/ColossalAI/examples/images/resnet/train.py", line 131, in main
colossalai.launch_from_torch(config={})
File "/home/ken/miniconda3/lib/python3.11/site-packages/colossalai/initialize.py", line 172, in launch_from_torch
launch(
File "/home/ken/miniconda3/lib/python3.11/site-packages/colossalai/initialize.py", line 55, in launch
dist.init_process_group(rank=rank, world_size=world_size, backend=backend, init_method=init_method)
File "/home/ken/miniconda3/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 74, in wrapper
func_return = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ken/miniconda3/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1138, in init_process_group
rendezvous_iterator = rendezvous(
^^^^^^^^^^^
File "/home/ken/miniconda3/lib/python3.11/site-packages/torch/distributed/rendezvous.py", line 98, in rendezvous
return _rendezvous_helper(url, rank, world_size, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ken/miniconda3/lib/python3.11/site-packages/torch/distributed/rendezvous.py", line 60, in _rendezvous_helper
result = urlparse(url)
^^^^^^^^^^^^^
File "/home/ken/miniconda3/lib/python3.11/urllib/parse.py", line 395, in urlparse
splitresult = urlsplit(url, scheme, allow_fragments)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ken/miniconda3/lib/python3.11/urllib/parse.py", line 500, in urlsplit
_check_bracketed_host(bracketed_host)
File "/home/ken/miniconda3/lib/python3.11/urllib/parse.py", line 448, in _check_bracketed_host
raise ValueError(f"An IPv4 address cannot be in brackets")
ValueError: An IPv4 address cannot be in brackets
[2023-12-26 10:18:02,536] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 512) of binary: /home/ken/miniconda3/bin/python
Traceback (most recent call last):
File "/home/ken/miniconda3/bin/torchrun", line 8, in
sys.exit(main())
^^^^^^
File "/home/ken/miniconda3/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
(base) ken@ken-pc:~/ColossalAI/examples/images/resnet$
run(args)
File "/home/ken/miniconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/home/ken/miniconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ken/miniconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2023-12-26_10:18:02
host : ken-pc.
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 512)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Error: failed to run torchrun --nproc_per_node=1 --nnodes=1 --node_rank=0 --master_addr=127.0.0.1 --master_port=29500 train.py on 127.0.0.1, is localhost: True, exception: Encountered a bad command exit code!
from colossalai.
Hi, could you please post the command you ran?
from colossalai.
@flybird11111
I cd to the dir of the resnet example and run following command
colossalai run --nproc_per_node 1 train.py
from colossalai.
Hi, it seems something wrong with python 3.11, we will fix it soon, can you change the python version to python 3.10, It works in my environment.
from colossalai.
@flybird11111
I use "conda create -n py310 python=3.10" and "source activate py310"
Then I use pip install the lib from source code
I try to run colossalai run --nproc_per_node 1 train.py in reset dir
And I get following error:
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
But I have installed the cude toolkit, I can find cuda in /usr/local/cuda,please see following image
And I have add env in ~/.bashrc
How can I run the example? Thank you
from colossalai.
export PATH=/usr/local/cuda/bin:${PATH}, how about adding it to ~/.bashrc
from colossalai.
@flybird11111
using vim ~/.bashrc
append at the bottom. please see following text. Thank you
i_f ! shopt -oq posix; then
if [ -f /usr/share/bash-completion/bash_completion ]; then
. /usr/share/bash-completion/bash_completion
elif [ -f /etc/bash_completion ]; then
. /etc/bash_completion
fi
fi
>>> conda initialize >>>
!! Contents within this block are managed by 'conda init' !!
__conda_setup="$('/home/ken/miniconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
eval "$__conda_setup"
else
if [ -f "/home/ken/miniconda3/etc/profile.d/conda.sh" ]; then
. "/home/ken/miniconda3/etc/profile.d/conda.sh"
else
export PATH="/home/ken/miniconda3/bin:$PATH"
fi
fi
unset __conda_setup
<<< conda initialize <<<
export LD_LIBRARY_PATH=/usr/local/cuda/lib64
export PATH=$PATH:/usr/local/cuda/bin
export CUDA_HOME=/usr/local/cuda
from colossalai.
Related Issues (20)
- [BUG], please delete this item.
- [FEATURE]: cuda 12 support HOT 2
- [BUG]: ValueError: mutable default <class 'colossalai.legacy.tensor.distspec._DistSpec'> for field dist_attr is not allowed: use default_factory HOT 1
- [BUG]: AttributeError: type object 'ColoParameter' has no attribute 'from_torch_tensor' when run hybrid_parallel example HOT 3
- [FEATURE]: Support qwen2 model
- [BUG]: OOM when saving 70B model HOT 2
- [DOC]: What is the datasetset used to train the Colossal-Llama-2? HOT 1
- [BUG]: Running ColossalAI in H800 with torch 2.0 HOT 28
- [BUG]: pretraing llama2 using "gemini" plugin, can not resume from saved checkpoints HOT 1
- [BUG] [Shardformer]: Error in blip2 testing with half precision HOT 1
- [FEATURE]: support multiple (partial) backward passes for zero
- [BUG]: re-join str type error_msgs using `\n\t` in general_checkpoint_io
- how to wrapped multiple models with booster HOT 3
- [BUG]: ColossalMoE Train: AssertionError: Parameters are expected to have the same dtype `torch.bfloat16`, but got `torch.float32` HOT 1
- [PROPOSAL]: Fix potential github action smells
- Does colossalai support rocm? HOT 1
- [BUG]: Slack link is invalid HOT 1
- [BUG]: GROK-1 does not support do_sample
- [BUG]: TypeError: _gen_python_code() got an unexpected keyword argument 'verbose' HOT 2
- [BUG]: llama2 hybrid_parallel or 3d giving None loss when using pp_size > 1 HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from colossalai.