Git Product home page Git Product logo

Comments (7)

wangruohui avatar wangruohui commented on May 29, 2024 1

Hi @KevinNuNu ,

After fixing this bug (I just use RANK), I tested on two nodes, connected with both infiniband and ethernet, each using 1xA100.

Using deepspeed command I can't run the program, it is stuck somewhere even before your screenshot. I think it is some problem at my side, I use pdsh, which is the default launcher.

But if I manually issue command on both servers, it works. I can visualize GPU running on both server and the program output correct answers.

The command is like

python -m deepspeed.launcher.launch --world_info=eyJTSC1JREMxLTEwLTE0MC0yNC0xNDAiOiBbMF0sICJTSC1JREMxLTEwLTE0MC0yNC0xNDIiOiBbMF19 --node_rank=0 --master_addr=10.140.24.140 --master_port=29500 --module lmdeploy.pytorch.chat PATH_TO_INTERNLM

Note that node_rank is respectively 0 and 1 on both servers, and world_info is actually a base64encode of a config. You can find it from the log of deepspeed and substitute to the command above.

The prompt is printed on my master side and I can interact with it.

from lmdeploy.

wangruohui avatar wangruohui commented on May 29, 2024 1

After some testing I can reproduce your bug. I think problem lies at pdsh's incapability handling stdin. The process is actually waiting user's input inside an SSH session from mater to master, instead of the session you are interacting with shell (on master).

To verify this, you can find the pid of the program (via nvidia-smi and is 233507 in the following command), and directly send input to its stdin via echo -e "自我介绍一下吧\n\n" > /proc/233507/fd/0. The model will response and output, to the shell you launch deepspeed.

So I think you can either try launching workers manually on two servers (like above) or try using other backend that both supported by deepspeed and supporting stdin forwarding. (But I don't know which actually satisfy 😜).

Reference: https://unix.stackexchange.com/questions/544052/pdsh-input-from-file-possible

from lmdeploy.

wangruohui avatar wangruohui commented on May 29, 2024

I don't have 2 nodes on hand now to test the issue. From your screenshot, it looks like that >>> is not properly output. Is 30.160.40.182 (which outputs master) the one you are interacting with? I suspect that the prompt is output to another server.

_on_master = local_rank == 0 should be equivalent to _on_master = get_rank() == 0. So changing this may not affect things.

Moreover, I noticed that you are using Ethernet over two nodes. For tensor parallel, it should be very slow (1-10Gb/s as compared to NVLink of 300GB/s bandwidth).

from lmdeploy.

KevinNuNu avatar KevinNuNu commented on May 29, 2024

I don't have 2 nodes on hand now to test the issue. From your screenshot, it looks like that >>> is not properly output. Is 30.160.40.182 (which outputs master) the one you are interacting with? I suspect that the prompt is output to another server.

_on_master = local_rank == 0 should be equivalent to _on_master = get_rank() == 0. So changing this may not affect things.

Moreover, I noticed that you are using Ethernet over two nodes. For tensor parallel, it should be very slow (1-10Gb/s as compared to NVLink of 300GB/s bandwidth).

yep,30.160.40.182 is the master node i am interacting with.

actually,_on_master = local_rank == 0 is not equivalent to _on_master = get_rank() == 0.
The local_rank value of the first GPU of each node is 0, i think it's a bug.
image

Yes it's slow but the warmup is done which proves that tp forward is successful.
My problem is that sometimes neither input_prompt “double enter to end input >>>” can be displayed.

from lmdeploy.

wangruohui avatar wangruohui commented on May 29, 2024

I think we need RANK instead of LOCAL_RANK.

from lmdeploy.

KevinNuNu avatar KevinNuNu commented on May 29, 2024

I think we need RANK instead of LOCAL_RANK.

yep,i have used mmengine.dist.get_rank() instead of it, but it's still stuck, can u test it in A100 later?

from lmdeploy.

lvhan028 avatar lvhan028 commented on May 29, 2024

Close it since no more activity in two weeks. Feel free to reopen it if it is still an issue

from lmdeploy.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.