Git Product home page Git Product logo

Comments (9)

delock avatar delock commented on July 19, 2024 2

Hi @daehuikim I use the following command and can see cuda op status. Note I don't have CUDA toolchain installed. Is your environment have CUDA toolchain you should be able to see desired result on your master node.

DS_ACCELERATOR=cuda DS_BUILD_FUSED_ADAM=1 pip install deepspeed
DS_ACCELERATOR=cuda ds_report

You may want to set in your .bashrc if you wish to build CUDA by default on master node. You don't need this env on compute node but it will work as well.

(dscpu) 22:07:19|~/machine_learning/DeepSpeed$ DS_ACCELERATOR=cuda ds_report
[2024-05-27 22:07:36,330] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (override)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
 [WARNING]  using untested triton version (2.1.0), only 1.0.0 is known to be compatible
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fp_quantizer ........... [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
 [WARNING]  using untested triton version (2.1.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/akey/anaconda3/envs/dscpu/lib/python3.10/site-packages/torch']
torch version .................... 2.1.0+cu121
deepspeed install path ........... ['/home/akey/anaconda3/envs/dscpu/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.14.2, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version .....................  [FAIL] cannot find CUDA_HOME via torch.utils.cpp_extension.CUDA_HOME=None 
deepspeed wheel compiled w. ...... torch 2.1, cuda 12.1
shared memory (/dev/shm) size .... 31.18 GB

from deepspeed.

loadams avatar loadams commented on July 19, 2024

Hi @daehuikim - are you able to run pip install deepspeed with no errors? And do you hit any errors when installing other ops?

It appears that your system is being detected as a CPU, but you have installed torch+cuda, can you tell us more about what accelerator you are trying to use?

[2024-05-21 11:34:41,285] [WARNING] [real_accelerator.py:162:get_accelerator] Setting accelerator to CPU. If you have GPU or other accelerator, we were unable to detect it.

from deepspeed.

daehuikim avatar daehuikim commented on July 19, 2024

Hello @loadams Thanks for your replying.
pip install deepspeed works without any errors for me.
I am running my script on my master node which has CPU only using slurm scheduler.
Specifically, I am activating conda virtual environment that has packages and propagate works to worker node which has multiple GPUs using slurm scheduler.
Therefore, I am trying to install deepspeed with ops in my conda virtual environment.

from deepspeed.

loadams avatar loadams commented on July 19, 2024

I see, is there a reason that you need to precompile the ops? Since you should be able to run DeepSpeed on the GPU nodes and it will detect the GPU and then JIT compile the ops (information here.)

from deepspeed.

daehuikim avatar daehuikim commented on July 19, 2024

@loadams
There is no reason for doing this.
I was just following this tutorial about finetuning t5 model.
I found another way to utilize fused adm just adding
torch_adam=true
in optimizers in deepspeed config now.
I just wanted to let contributors know this(failing pre-build installation in some environment) happens.
Thanks for replying!

from deepspeed.

loadams avatar loadams commented on July 19, 2024

Thanks @daehuikim - that makes sense, since it currently believes your environment is a CPU environment on your master node, so it believes that it can only run certain ops that are installed. Can you try running with the following (this may not work since you don't have cuda installed on the node, but if you do, you can specify the type of DeepSpeed accelerator to build for with the DS_ACCELERATOR=cuda env var added before your pip install command?

from deepspeed.

daehuikim avatar daehuikim commented on July 19, 2024
pip uninstall deepspeed
DS_ACCELERATOR=cuda pip install deepspeed
ds_report

produces the result like below

[2024-05-24 09:17:18,762] [WARNING] [real_accelerator.py:162:get_accelerator] Setting accelerator to CPU. If you have GPU or other accelerator, we were unable to detect it.
[2024-05-24 09:17:18,763] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cpu (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
deepspeed_not_implemented  [NO] ....... [OKAY]
deepspeed_ccl_comm ..... [NO] ....... [OKAY]
deepspeed_shm_comm ..... [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['TORCH_INSTALL_PATH']
torch version .................... 2.1.2+cu121
deepspeed install path ........... ['DEEPSPEED_INSTALL_PATH']
deepspeed info ................... 0.14.2+cu118torch2.0, unknown, unknown
deepspeed wheel compiled w. ...... torch 2.0
shared memory (/dev/shm) size .... 125.67 GB

@loadams I tried recommended variable and got same result.

from deepspeed.

daehuikim avatar daehuikim commented on July 19, 2024

DS_ACCELERATOR=cuda ds_report

@delock Your recommendation made everything perfect! Thanks for giving nice advice on it!
I got same results with you Thanks again :)

from deepspeed.

loadams avatar loadams commented on July 19, 2024

Thanks for clarifying the env var use, @delock!

from deepspeed.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.