<a target="_blank" rel="noopener noreferrer" href="https://private-user-images.githubu

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

I think we need RANK instead of <code class="notransl

I think we need RANK instead of <code cl

When using lmdeploy.pytorch.chat to infer on 2 nodes, the program stuck. about lmdeploy HOT 7 CLOSED

KevinNuNu commented on May 29, 2024

When using lmdeploy.pytorch.chat to infer on 2 nodes, the program stuck.

from lmdeploy.

Comments (7)

wangruohui commented on May 29, 2024 1

Hi @KevinNuNu ,

After fixing this bug (I just use RANK), I tested on two nodes, connected with both infiniband and ethernet, each using 1xA100.

~~Using deepspeed command I can't run the program, it is stuck somewhere even before your screenshot. I think it is some problem at my side, I use pdsh, which is the default launcher.~~

But if I manually issue command on both servers, it works. I can visualize GPU running on both server and the program output correct answers.

The command is like

python -m deepspeed.launcher.launch --world_info=eyJTSC1JREMxLTEwLTE0MC0yNC0xNDAiOiBbMF0sICJTSC1JREMxLTEwLTE0MC0yNC0xNDIiOiBbMF19 --node_rank=0 --master_addr=10.140.24.140 --master_port=29500 --module lmdeploy.pytorch.chat PATH_TO_INTERNLM

Note that node_rank is respectively 0 and 1 on both servers, and world_info is actually a base64encode of a config. You can find it from the log of deepspeed and substitute to the command above.

The prompt is printed on my master side and I can interact with it.

from lmdeploy.

wangruohui commented on May 29, 2024 1

After some testing I can reproduce your bug. I think problem lies at pdsh's incapability handling stdin. The process is actually waiting user's input inside an SSH session from mater to master, instead of the session you are interacting with shell (on master).

To verify this, you can find the pid of the program (via nvidia-smi and is 233507 in the following command), and directly send input to its stdin via echo -e "自我介绍一下吧\n\n" > /proc/233507/fd/0. The model will response and output, to the shell you launch deepspeed.

So I think you can either try launching workers manually on two servers (like above) or try using other backend that both supported by deepspeed and supporting stdin forwarding. (But I don't know which actually satisfy 😜).

Reference: https://unix.stackexchange.com/questions/544052/pdsh-input-from-file-possible

from lmdeploy.

wangruohui commented on May 29, 2024

I don't have 2 nodes on hand now to test the issue. From your screenshot, it looks like that >>> is not properly output. Is 30.160.40.182 (which outputs master) the one you are interacting with? I suspect that the prompt is output to another server.

_on_master = local_rank == 0 should be equivalent to _on_master = get_rank() == 0. So changing this may not affect things.

Moreover, I noticed that you are using Ethernet over two nodes. For tensor parallel, it should be very slow (1-10Gb/s as compared to NVLink of 300GB/s bandwidth).

from lmdeploy.

KevinNuNu commented on May 29, 2024

I don't have 2 nodes on hand now to test the issue. From your screenshot, it looks like that >>> is not properly output. Is 30.160.40.182 (which outputs master) the one you are interacting with? I suspect that the prompt is output to another server.

_on_master = local_rank == 0 should be equivalent to _on_master = get_rank() == 0. So changing this may not affect things.

Moreover, I noticed that you are using Ethernet over two nodes. For tensor parallel, it should be very slow (1-10Gb/s as compared to NVLink of 300GB/s bandwidth).

yep，30.160.40.182 is the master node i am interacting with.

actually，_on_master = local_rank == 0 is not equivalent to _on_master = get_rank() == 0.
The local_rank value of the first GPU of each node is 0, i think it's a bug.

Yes it's slow but the warmup is done which proves that tp forward is successful.
My problem is that sometimes neither input_prompt “double enter to end input >>>” can be displayed.

from lmdeploy.

wangruohui commented on May 29, 2024

I think we need RANK instead of LOCAL_RANK.

from lmdeploy.

KevinNuNu commented on May 29, 2024

I think we need RANK instead of LOCAL_RANK.

yep，i have used mmengine.dist.get_rank() instead of it, but it's still stuck, can u test it in A100 later?

from lmdeploy.

lvhan028 commented on May 29, 2024

Close it since no more activity in two weeks. Feel free to reopen it if it is still an issue

from lmdeploy.

When using lmdeploy.pytorch.chat to infer on 2 nodes, the program stuck. about lmdeploy HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent