Git Product home page Git Product logo

Comments (25)

SmorkalovME avatar SmorkalovME commented on September 16, 2024

Hi Zbigniew,

Could you please check if you observe the same issue when setting "I_MPI_HYDRA_BOOTSTRAP=ssh" variable instead of using pbs/rsh?

Thanks,
Mikhail

from mlsl.

zj88 avatar zj88 commented on September 16, 2024

Hi Mikhail,

Thanks for your reply. When I set "I_MPI_HYDRA_BOOTSTRAP=ssh" I get:

usage: /opt/pbs/default/bin/pbs_tmrsh [-n][-l username] host [-n][-l username] command
/opt/pbs/default/bin/pbs_tmrsh --version
=>> PBS: job killed: walltime 78 exceeded limit 60
[mpiexec@r1i4n32] HYDU_sock_write (../../utils/sock/sock.c:418): write error (Bad file descriptor)
[mpiexec@r1i4n32] HYD_pmcd_pmiserv_send_signal (../../pm/pmiserv/pmiserv_cb.c:252): unable to write data to proxy
[mpiexec@r1i4n32] ui_cmd_cb (../../pm/pmiserv/pmiserv_pmci.c:174): unable to send signal downstream
[mpiexec@r1i4n32] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
[mpiexec@r1i4n32] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:501): error waiting for event
[mpiexec@r1i4n32] main (../../ui/mpich/mpiexec.c:1147): process manager error waiting for completion

I can track some commands with pbs_tmrsh in the output:

Proxy launch args: <mlsl root path>/intel64/bin/pmi_proxy --control-port r1i4n32:39059 --debug --pmi-connect alltoall --pmi-aggregate -s 0 --rmk pbs --launcher ssh --launcher-exec /opt/pbs/default/bin/pbs_tmrsh --demux poll --pgid 0 --enable-stdin 1 --retries 10 --control-code 1938569249 --usize -2 --proxy-id

[mpiexec@r1i4n32] Launch arguments: <mlsl root path>/intel64/bin/pmi_proxy --control-port r1i4n32:39059 --debug --pmi-connect alltoall --pmi-aggregate -s 0 --rmk pbs --launcher ssh --launcher-exec /opt/pbs/default/bin/pbs_tmrsh --demux poll --pgid 0 --enable-stdin 1 --retries 10 --control-code 1938569249 --usize -2 --proxy-id 0

[mpiexec@r1i4n32] Launch arguments: /opt/pbs/default/bin/pbs_tmrsh -x -q r1i4n33 <mlsl root path>/intel64/bin/pmi_proxy --control-port r1i4n32:39059 --debug --pmi-connect alltoall --pmi-aggregate -s 0 --rmk pbs --launcher ssh --launcher-exec /opt/pbs/default/bin/pbs_tmrsh --demux poll --pgid 0 --enable-stdin 1 --retries 10 --control-code 1938569249 --usize -2 --proxy-id 1

from mlsl.

SmorkalovME avatar SmorkalovME commented on September 16, 2024

Thanks Zbigniew. Would you please collect the output with I_MPI_HYDRA_DEBUG=1 environment variable set and you initial launch approach (I_MPI_HYDRA_BOOTSTRAP=rsh)? Please note that this debug output may contain all env variables set in your session, so if you have something private there, please cut it off.

from mlsl.

zj88 avatar zj88 commented on September 16, 2024

I replaced some info with variables <...>. Hope it's readable.

mlsl_debug.txt

from mlsl.

SmorkalovME avatar SmorkalovME commented on September 16, 2024

Thanks - it does help! Could you please try setting "MLSL_HOSTNAME_TYPE=1" in addition to what you already have in your env? If this doesn't help please recollect the debug output w/ I_MPI_HYDRA_DEBUG=1 and MLSL_HOSTNAME_TYPE=1.

from mlsl.

zj88 avatar zj88 commented on September 16, 2024

Sure, here you are:
mlsl_debug.txt

At least we've got different error and not timing out:
Fatal error in PMPI_Ibcast: Invalid communicator, error stack:
PMPI_Ibcast(1047): MPI_Ibcast(buffer=0x2aaab4021540, count=294912, datatype=MPI_FLOAT, comm=comm=0x1, request=0x73f5c0)
PMPI_Ibcast(989).: Invalid communicator
Fatal error in PMPI_Ibcast: Invalid communicator, error stack:
PMPI_Ibcast(1047): MPI_Ibcast(buffer=0x2aaab4021540, count=294912, datatype=MPI_FLOAT, comm=comm=0x1, request=0x74a380)
PMPI_Ibcast(989).: Invalid communicator
[proxy:1:1@r1i4n33] HYD_pmcd_pmip_control_cmd_cb (../../pm/pmiserv/pmip_cb.c:3481): assert (!closed) failed
[proxy:1:1@r1i4n33] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
[proxy:1:1@r1i4n33] main (../../pm/pmiserv/pmip.c:558): demux engine error waiting for event

By the way, I had this error before, when setting "I_MPI_HYDRA_BOOTSTRAP=ssh" and also "I_MPI_HYDRA_BOOTSTRAP_EXEC=ssh" (MLSL_HOSTNAME_TYPE unset).

from mlsl.

zj88 avatar zj88 commented on September 16, 2024

Actually, setting MLSL_HOSTNAME_TYPE=2 seems to work! No error returned. Could you please check if this is the correct output:
mlsl_debug.txt

from mlsl.

VinnitskiV avatar VinnitskiV commented on September 16, 2024

Hi Zbigniew,

Unfortunately, MLSL_HOSTNAME_TYPE=2 doesn't work without MLSL_IFACE_IDX or MLSL_IFACE_NAME env.

Could you please collect full output(out and err; if you have something private there, please cut it off ) with I_MPI_HYDRA_DEBUG=1, MLSL_LOG_LEVEL=5 and MLSL_HOSTNAME_TYPE=1? Also, add "-l" in your mpirun command line, like this:

  • mpirun -n 2 -ppn 1 -l -hosts ***.

--
BR,
Vladimir

from mlsl.

zj88 avatar zj88 commented on September 16, 2024

out:
mlsl_debug.txt

err:
[1] Fatal error in PMPI_Ibcast: Invalid communicator, error stack:
[1] PMPI_Ibcast(1047): MPI_Ibcast(buffer=0x2aaab4021540, count=294912, datatype=MPI_FLOAT, comm=comm=0x1, request=0x73f680)
[1] PMPI_Ibcast(989).: Invalid communicator
[0] Fatal error in PMPI_Ibcast: Invalid communicator, error stack:
[0] PMPI_Ibcast(1047): MPI_Ibcast(buffer=0x2aaab4021540, count=294912, datatype=MPI_FLOAT, comm=comm=0x1, request=0x74a3c0)
[0] PMPI_Ibcast(989).: Invalid communicator
[proxy:1:1@r1i0n3] HYD_pmcd_pmip_control_cmd_cb (../../pm/pmiserv/pmip_cb.c:3481): assert (!closed) failed
[proxy:1:1@r1i0n3] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
[proxy:1:1@r1i0n3] main (../../pm/pmiserv/pmip.c:558): demux engine error waiting for event

from mlsl.

VinnitskiV avatar VinnitskiV commented on September 16, 2024

Thanks, Zbigniew,

Could you please try to reproduce this issue with latest Intel mlsl version: Intel(R) MLSL 2018 Update 1 Preview?

--
BR,
Vladimir

from mlsl.

zj88 avatar zj88 commented on September 16, 2024

Hi Vladimir,

It seems to be working well with with latest mlsl, but please take a look:
mlsl_debug.txt
Error file is empty, mlsl used is mlsl_2018.1.005.

from mlsl.

VinnitskiV avatar VinnitskiV commented on September 16, 2024

Yes, it's working right.

So, for next runs you must set MLSL_HOSTNAME_TYPE=1.

If you don't have other questions we will close this issue.

--
BR,
Vladimir

from mlsl.

zj88 avatar zj88 commented on September 16, 2024

Yes, you can close it. Thanks Mikhail and Vladimir for your help!

from mlsl.

SmorkalovME avatar SmorkalovME commented on September 16, 2024

Thanks Zbigniew. Please let us know in case you face any further issues.

from mlsl.

zj88 avatar zj88 commented on September 16, 2024

Hi again,

Unfortunately I am having similar issue while running on 16 nodes. This time however when setting MLSL_NUM_SERVERS>3. So this config runs correctly:

#PBS -l select=16:ncpus=64:mpiprocs=1:ompthreads=61
export MLSL_NUM_SERVERS=3
export MLSL_SERVER_AFFINITY="61,62,63"
export MLSL_SERVER_CREATION_TYPE=0
export MLSL_HOSTNAME_TYPE=1
mpirun -n 16 -ppn 1 -l -f $HOSTFILE ./mlsl_test 1

while this one times out:

#PBS -l select=16:ncpus=64:mpiprocs=1:ompthreads=60
export MLSL_NUM_SERVERS=4
export MLSL_SERVER_AFFINITY="60,61,62,63"
export MLSL_SERVER_CREATION_TYPE=0
export MLSL_HOSTNAME_TYPE=1
mpirun -n 16 -ppn 1 -l -f $HOSTFILE ./mlsl_test 1

Do you have any ideas what could be wrong here? I'm attaching output files:
mlsl_debug-success.txt
mlsl_debug-fail.txt
Mlsl used is mlsl_2018.1.005.

Also running on 8 nodes with 4 mlsl servers is successful.

from mlsl.

VinnitskiV avatar VinnitskiV commented on September 16, 2024

@zj88
Hello, could you please collect results with debug mode:

make clean && make ENABLE_DEBUG=1 && make install

from mlsl.

zj88 avatar zj88 commented on September 16, 2024

Thanks for the answer. I have built mlsl in debug, here is the output:
mlsl_output.txt

I had this config:
#PBS -l select=16:ncpus=64:mpiprocs=1
export MLSL_SERVER_CREATION_TYPE=0
export MLSL_HOSTNAME_TYPE=1
export OMP_NUM_THREADS=60
export KMP_HW_SUBSET=1t
export MLSL_NUM_SERVERS=4
export MLSL_SERVER_AFFINITY=6,7,8,9
export KMP_AFFINITY="proclist=[0-5,10-63],granularity=thread,explicit"

from mlsl.

VinnitskiV avatar VinnitskiV commented on September 16, 2024

@zj88
Thank you, looks like the problem is in a small walltime.
Could you try use 5 minutes?

from mlsl.

zj88 avatar zj88 commented on September 16, 2024

Sorry for the delay, the machine was busy.
Unfortunately the output is almost the same for 5 min:
mlsl_output.txt
I'm not sure what could be wrong here

from mlsl.

VinnitskiV avatar VinnitskiV commented on September 16, 2024

@zj88
Could you run with MLSL_SERVER_CREATION_TYPE=1?

from mlsl.

zj88 avatar zj88 commented on September 16, 2024

Similar result:
mlsl_output.txt
I tried also with 10 min, but ends with the same error code and with almost the same output.

from mlsl.

VinnitskiV avatar VinnitskiV commented on September 16, 2024

@zj88 Thank you,
Can you use ssh instead rsh? Like this
export I_MPI_HYDRA_BOOTSTRAP=ssh
export I_MPI_HYDRA_BOOTSTRAP_EXEC=

from mlsl.

zj88 avatar zj88 commented on September 16, 2024

Please take a look:
mlsl_output.txt
Thanks!

from mlsl.

VinnitskiV avatar VinnitskiV commented on September 16, 2024

@zj88 Thank you,
Could you run with:
export I_MPI_FABRICS=tcp or export I_MPI_FABRICS=ofa
If it do not helped("PASSED" lines in output), please send me debug output with MLSL_NUM_SERVERS=3

from mlsl.

zj88 avatar zj88 commented on September 16, 2024

export I_MPI_FABRICS=tcp helped.
Works great now, thanks a lot!

from mlsl.

Related Issues (19)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.