Comments (25)
Hi Zbigniew,
Could you please check if you observe the same issue when setting "I_MPI_HYDRA_BOOTSTRAP=ssh" variable instead of using pbs/rsh?
Thanks,
Mikhail
from mlsl.
Hi Mikhail,
Thanks for your reply. When I set "I_MPI_HYDRA_BOOTSTRAP=ssh" I get:
usage: /opt/pbs/default/bin/pbs_tmrsh [-n][-l username] host [-n][-l username] command
/opt/pbs/default/bin/pbs_tmrsh --version
=>> PBS: job killed: walltime 78 exceeded limit 60
[mpiexec@r1i4n32] HYDU_sock_write (../../utils/sock/sock.c:418): write error (Bad file descriptor)
[mpiexec@r1i4n32] HYD_pmcd_pmiserv_send_signal (../../pm/pmiserv/pmiserv_cb.c:252): unable to write data to proxy
[mpiexec@r1i4n32] ui_cmd_cb (../../pm/pmiserv/pmiserv_pmci.c:174): unable to send signal downstream
[mpiexec@r1i4n32] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
[mpiexec@r1i4n32] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:501): error waiting for event
[mpiexec@r1i4n32] main (../../ui/mpich/mpiexec.c:1147): process manager error waiting for completion
I can track some commands with pbs_tmrsh in the output:
Proxy launch args: <mlsl root path>/intel64/bin/pmi_proxy --control-port r1i4n32:39059 --debug --pmi-connect alltoall --pmi-aggregate -s 0 --rmk pbs --launcher ssh --launcher-exec /opt/pbs/default/bin/pbs_tmrsh --demux poll --pgid 0 --enable-stdin 1 --retries 10 --control-code 1938569249 --usize -2 --proxy-id
[mpiexec@r1i4n32] Launch arguments: <mlsl root path>/intel64/bin/pmi_proxy --control-port r1i4n32:39059 --debug --pmi-connect alltoall --pmi-aggregate -s 0 --rmk pbs --launcher ssh --launcher-exec /opt/pbs/default/bin/pbs_tmrsh --demux poll --pgid 0 --enable-stdin 1 --retries 10 --control-code 1938569249 --usize -2 --proxy-id 0
[mpiexec@r1i4n32] Launch arguments: /opt/pbs/default/bin/pbs_tmrsh -x -q r1i4n33 <mlsl root path>/intel64/bin/pmi_proxy --control-port r1i4n32:39059 --debug --pmi-connect alltoall --pmi-aggregate -s 0 --rmk pbs --launcher ssh --launcher-exec /opt/pbs/default/bin/pbs_tmrsh --demux poll --pgid 0 --enable-stdin 1 --retries 10 --control-code 1938569249 --usize -2 --proxy-id 1
from mlsl.
Thanks Zbigniew. Would you please collect the output with I_MPI_HYDRA_DEBUG=1 environment variable set and you initial launch approach (I_MPI_HYDRA_BOOTSTRAP=rsh)? Please note that this debug output may contain all env variables set in your session, so if you have something private there, please cut it off.
from mlsl.
I replaced some info with variables <...>. Hope it's readable.
from mlsl.
Thanks - it does help! Could you please try setting "MLSL_HOSTNAME_TYPE=1" in addition to what you already have in your env? If this doesn't help please recollect the debug output w/ I_MPI_HYDRA_DEBUG=1 and MLSL_HOSTNAME_TYPE=1.
from mlsl.
Sure, here you are:
mlsl_debug.txt
At least we've got different error and not timing out:
Fatal error in PMPI_Ibcast: Invalid communicator, error stack:
PMPI_Ibcast(1047): MPI_Ibcast(buffer=0x2aaab4021540, count=294912, datatype=MPI_FLOAT, comm=comm=0x1, request=0x73f5c0)
PMPI_Ibcast(989).: Invalid communicator
Fatal error in PMPI_Ibcast: Invalid communicator, error stack:
PMPI_Ibcast(1047): MPI_Ibcast(buffer=0x2aaab4021540, count=294912, datatype=MPI_FLOAT, comm=comm=0x1, request=0x74a380)
PMPI_Ibcast(989).: Invalid communicator
[proxy:1:1@r1i4n33] HYD_pmcd_pmip_control_cmd_cb (../../pm/pmiserv/pmip_cb.c:3481): assert (!closed) failed
[proxy:1:1@r1i4n33] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
[proxy:1:1@r1i4n33] main (../../pm/pmiserv/pmip.c:558): demux engine error waiting for event
By the way, I had this error before, when setting "I_MPI_HYDRA_BOOTSTRAP=ssh" and also "I_MPI_HYDRA_BOOTSTRAP_EXEC=ssh" (MLSL_HOSTNAME_TYPE unset).
from mlsl.
Actually, setting MLSL_HOSTNAME_TYPE=2 seems to work! No error returned. Could you please check if this is the correct output:
mlsl_debug.txt
from mlsl.
Hi Zbigniew,
Unfortunately, MLSL_HOSTNAME_TYPE=2 doesn't work without MLSL_IFACE_IDX or MLSL_IFACE_NAME env.
Could you please collect full output(out and err; if you have something private there, please cut it off ) with I_MPI_HYDRA_DEBUG=1, MLSL_LOG_LEVEL=5 and MLSL_HOSTNAME_TYPE=1? Also, add "-l" in your mpirun command line, like this:
- mpirun -n 2 -ppn 1 -l -hosts ***.
--
BR,
Vladimir
from mlsl.
out:
mlsl_debug.txt
err:
[1] Fatal error in PMPI_Ibcast: Invalid communicator, error stack:
[1] PMPI_Ibcast(1047): MPI_Ibcast(buffer=0x2aaab4021540, count=294912, datatype=MPI_FLOAT, comm=comm=0x1, request=0x73f680)
[1] PMPI_Ibcast(989).: Invalid communicator
[0] Fatal error in PMPI_Ibcast: Invalid communicator, error stack:
[0] PMPI_Ibcast(1047): MPI_Ibcast(buffer=0x2aaab4021540, count=294912, datatype=MPI_FLOAT, comm=comm=0x1, request=0x74a3c0)
[0] PMPI_Ibcast(989).: Invalid communicator
[proxy:1:1@r1i0n3] HYD_pmcd_pmip_control_cmd_cb (../../pm/pmiserv/pmip_cb.c:3481): assert (!closed) failed
[proxy:1:1@r1i0n3] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
[proxy:1:1@r1i0n3] main (../../pm/pmiserv/pmip.c:558): demux engine error waiting for event
from mlsl.
Thanks, Zbigniew,
Could you please try to reproduce this issue with latest Intel mlsl version: Intel(R) MLSL 2018 Update 1 Preview?
--
BR,
Vladimir
from mlsl.
Hi Vladimir,
It seems to be working well with with latest mlsl, but please take a look:
mlsl_debug.txt
Error file is empty, mlsl used is mlsl_2018.1.005.
from mlsl.
Yes, it's working right.
So, for next runs you must set MLSL_HOSTNAME_TYPE=1.
If you don't have other questions we will close this issue.
--
BR,
Vladimir
from mlsl.
Yes, you can close it. Thanks Mikhail and Vladimir for your help!
from mlsl.
Thanks Zbigniew. Please let us know in case you face any further issues.
from mlsl.
Hi again,
Unfortunately I am having similar issue while running on 16 nodes. This time however when setting MLSL_NUM_SERVERS>3. So this config runs correctly:
#PBS -l select=16:ncpus=64:mpiprocs=1:ompthreads=61
export MLSL_NUM_SERVERS=3
export MLSL_SERVER_AFFINITY="61,62,63"
export MLSL_SERVER_CREATION_TYPE=0
export MLSL_HOSTNAME_TYPE=1
mpirun -n 16 -ppn 1 -l -f $HOSTFILE ./mlsl_test 1
while this one times out:
#PBS -l select=16:ncpus=64:mpiprocs=1:ompthreads=60
export MLSL_NUM_SERVERS=4
export MLSL_SERVER_AFFINITY="60,61,62,63"
export MLSL_SERVER_CREATION_TYPE=0
export MLSL_HOSTNAME_TYPE=1
mpirun -n 16 -ppn 1 -l -f $HOSTFILE ./mlsl_test 1
Do you have any ideas what could be wrong here? I'm attaching output files:
mlsl_debug-success.txt
mlsl_debug-fail.txt
Mlsl used is mlsl_2018.1.005.
Also running on 8 nodes with 4 mlsl servers is successful.
from mlsl.
@zj88
Hello, could you please collect results with debug mode:
make clean && make ENABLE_DEBUG=1 && make install
from mlsl.
Thanks for the answer. I have built mlsl in debug, here is the output:
mlsl_output.txt
I had this config:
#PBS -l select=16:ncpus=64:mpiprocs=1
export MLSL_SERVER_CREATION_TYPE=0
export MLSL_HOSTNAME_TYPE=1
export OMP_NUM_THREADS=60
export KMP_HW_SUBSET=1t
export MLSL_NUM_SERVERS=4
export MLSL_SERVER_AFFINITY=6,7,8,9
export KMP_AFFINITY="proclist=[0-5,10-63],granularity=thread,explicit"
from mlsl.
@zj88
Thank you, looks like the problem is in a small walltime.
Could you try use 5 minutes?
from mlsl.
Sorry for the delay, the machine was busy.
Unfortunately the output is almost the same for 5 min:
mlsl_output.txt
I'm not sure what could be wrong here
from mlsl.
@zj88
Could you run with MLSL_SERVER_CREATION_TYPE=1?
from mlsl.
Similar result:
mlsl_output.txt
I tried also with 10 min, but ends with the same error code and with almost the same output.
from mlsl.
@zj88 Thank you,
Can you use ssh instead rsh? Like this
export I_MPI_HYDRA_BOOTSTRAP=ssh
export I_MPI_HYDRA_BOOTSTRAP_EXEC=
from mlsl.
Please take a look:
mlsl_output.txt
Thanks!
from mlsl.
@zj88 Thank you,
Could you run with:
export I_MPI_FABRICS=tcp
or export I_MPI_FABRICS=ofa
If it do not helped("PASSED" lines in output), please send me debug output with MLSL_NUM_SERVERS=3
from mlsl.
export I_MPI_FABRICS=tcp
helped.
Works great now, thanks a lot!
from mlsl.
Related Issues (19)
- make mlsl_test.cpp with gcc failed HOT 1
- Error while using MLSL with Intel Deep Learning SDK HOT 1
- Segment Fault when Calling MLSL::Environment::GetProcessIdx() HOT 2
- Where is install.sh HOT 1
- tf_cnn_benchmarks HOT 4
- ep_server build fails HOT 3
- run intel caffe using multi-node with mlsl on AMD cpus ,stopped at Iteration 0 HOT 3
- MAX_COMPUTE_OP is defined as 400 which is too small and failed to run distributed resnet50 and resnet101 HOT 1
- run multi-nodes intel-caffe on AMD cpus(x86) by mlsl ,only few cores used HOT 1
- I_MPI_FABRICS: only "tcp" works HOT 1
- Memory Leak
- Multi-node Training HOT 2
- how to Run MLSL program?
- Memory corrupted error : running MLSL on multiple node HOT 4
- assert (!(pollfds[i].revents & ~POLLIN & ~POLLOUT & ~POLLHUP & ~POLLERR)) failed : running MLSL on multiple node HOT 1
- Multi-node Training Deadlock HOT 17
- Can not run mlsl_test HOT 5
- Run mlsl_test on multi-node HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mlsl.