hpc / spindle Goto Github PK
View Code? Open in Web Editor NEWScalable dynamic library and python loading in HPC environments
License: Other
Scalable dynamic library and python loading in HPC environments
License: Other
A new page on the LLNL software portal (https://software.llnl.gov/radiuss/) will soon dynamically pull in RADIUSS repos. To achieve this, we need RADIUSS-related repos outside the LLNL org to be tagged with relevant topics. See LLNL/llnl.github.io#17, LLNL/llnl.github.io#151, & https://github.com/LLNL/llnl.github.io/blob/new-home-page/radiuss/README.md for additional context.
For Spindle, please add performance
and radiuss
.
Also, you may want to update the computation.llnl.gov in README to https://computation.llnl.gov/projects/spindle/.
I'm trying to install spindle and the make is failing with:
/bin/sh: not-found: command not found
make[2]: *** [libfuncdict.so] Error 127
make[2]: Leaving directory `/spindle/testsuite'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/spindle'
make: *** [all] Error 2
Is this something provided by an mpi library? I hadn't installed one in the container yet (is this required for spindle, or does it help with other kinds of loads outside of MPI?)
This feature is currently available by running with --debug=yes
. It is also leaving the first page of the global file mapped while the remaining pages are mapped to the local file. During testing, determine if the entire local file can be used.
Also, determine whether core dumping will work as expected with the text and data remapped in this way.
CentOS 7.6 on Intel Westmere
After building and installing LaunchMON and then trying to build Spindle from the current sources via git clone from their current GitHub source locations, I encounter the following problem in the "make" step. It appears to be having trouble finding and using libfuncdict.so in the testsuite area:
Making all in testsuite
make[2]: Entering directory `/root/spindle-test/Spindle/testsuite'
CCLD libfuncdict.so
/bin/sh: not-found: command not found
make[2]: *** [libfuncdict.so] Error 127
make[2]: Leaving directory `/root/spindle-test/Spindle/testsuite'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/root/spindle-test/Spindle'
make: *** [all] Error 2
Separately installing via spack, it seems to hang during the build and if I look for recently modified files, I see the following:
# find /tmp/root/spack-stage/spack-stage-spindle-0.8.1-udgnp63afjbx3uj5nz2k5qo7237kx442/ -mmin -10
/tmp/root/spack-stage/spack-stage-spindle-0.8.1-udgnp63afjbx3uj5nz2k5qo7237kx442/spack-src/testsuite
/tmp/root/spack-stage/spack-stage-spindle-0.8.1-udgnp63afjbx3uj5nz2k5qo7237kx442/spack-src/testsuite/libtest4000.so
Checking the shared library that is open, it appears also to lack libfunctdict.so:
# ldd /tmp/root/spack-stage/spack-stage-spindle-0.8.1-udgnp63afjbx3uj5nz2k5qo7237kx442/spack-src/testsuite/libtest4000.so
linux-vdso.so.1 => (0x00007ffd0472d000)
libfuncdict.so => not found
libc.so.6 => /lib64/libc.so.6 (0x00007f993a6e5000)
/lib64/ld-linux-x86-64.so.2 (0x00007f993afb3000)
/scratch/pmpi/dsolt/WORKSPACE/spindle/Spindle/src/fe/hostbin/launch_hostbin.cc:297: undefined reference to pthread_join' ../hostbin/.libs/libhostbin.a(libhostbin_la-launch_hostbin.o): In function
IOThread::~IOThread()':
/scratch/pmpi/dsolt/WORKSPACE/spindle/Spindle/src/fe/hostbin/launch_hostbin.cc:297: undefined reference to `pthread_join'
collect2: error: ld returned 1 exit status
I have question about SPINDLE with OpenMPI launcher.
Is there any way to use SPINDLE with OpenMPI launcher without MPIR?
For example, can SPINDLE run with PMIx instead of MPIR?
If there is no way currently, is there any plan to support PMIx?
I'm getting errors in testing and attempted usage that Spindle cannot connect to some session. I'm installing as follows:
./configure --with-munge-dir=/etc/munge --enable-sec-munge --with-slurm-dir=/etc/slurm --with-testrm=slurm
make
make install
And I've tried that with both slurm and openmpi as the "testrm" And then I make the tests
cd testsuite
make
./runTests
but no matter what I do (using the slurm or openmpi template, both of which I have) I see this error:
Running: ./run_driver --partial --session
ERROR: Spindle could not connect to session tn2VYQ
I saw this same error in trying to just use spindle so I've gone back to the tests to debug. Note that I do have a /tmp area:
ls /tmp/
ccFjQGLR.s ks-script-eC059Y spin.kT6PPu spin.tn2VYQ spin.Un7RTL yum.log
Update: I think it could possibly be that they need to see the same /tmp area - so I'm rebuilding the containers with a shared /tmp area and will report back.
Missing support for ppc64le
(Power8 and later) processors. Architectures ppc64
and x86_64
are currently supported.
Are there any plans on adding support for this architecture in the future?
It has been noticed that python will sometimes perform an fstat() operations on local .py files while performing stat() operations on global .pyc files which may yield unexpected results when comparing modification times of .py and .pyc files.
When I try SPINDLE, I found that $ORIGIN in $RPATH in nested dependency is not handled correctly and the process cannot load some libraries.
Example: When my python script on my environment imports matplotlib
,
/path/to/lib/python2.7/site-packages/numpy/core/multiarray.so
/tmp/spindle.PIDNUM/b0-_path_to_lib_python2.7_site-packages_numpy_core_multiarray.so
$ORIGIN/../.libs/tls/x86_64/libopenblasp.so
In this case, $ORIGIN/../.libs/tls/x86_64/libopenblasp.so
should be expands as /path/to/lib/python2.7/site-packages/numpy/core/../.libs/libopenblasp.so
.
However, SPINDLE expands as /tmp/spindle.PIDNUM/../.libs/tls/x86_64/libopenblasp.so
.
i.e. SPINDLE expands $ORIGIN
as /tmp/spindle.PIDNUM/
instead of /path/to/lib/python2.7/site-packages/numpy/core/
As a result, the process cannot load multiarray.so.
This issue may be similar to #17, but current SPINDLE runs with --debug=yes
in default.
A plain spack installation or manual installation without an active MPI variant defined throws an error with a missing mpi.h
in the testsuite, as below:
352 make[3]: Entering directory '/tmp/asill/spack-stage/spack-stage-spindle-0.8.1-av65uymhbjk5xlot4r7o7zrdplcrathu/spack-src/testsuite'
353 CC test_driver-test_driver.o
354 CC test_driver_libs-test_driver.o
>> 355 test_driver.c:17:10: fatal error: mpi.h: No such file or directory
356 #include <mpi.h>
357 ^~~~~~~
358 compilation terminated.
>> 359 test_driver.c:17:10: fatal error: mpi.h: No such file or directory
360 #include <mpi.h>
361 ^~~~~~~
362 compilation terminated.
363 make[3]: *** [Makefile:340: test_driver-test_driver.o] Error 1
364 make[3]: *** Waiting for unfinished jobs....
365 make[3]: *** [Makefile:356: test_driver_libs-test_driver.o] Error 1
Does Spindle require a specific MPI package to be set up to address the missing mpi.h
and if so, is a separate Spindle instance required for each MPI variant to be used? We have many MPI variants in use, of course, so the latter would definitely be a hassle to use, but I suspect I am missing something obvious here.
The spindle with application executable built with BIND_NOW option occur segmentation fault. I saw the fault on a x86 cluster and an aarch64 cluster.
I confirmed the following reproduce steps on the x86 cluster.
The linker version in x86 cluster.
$ LC_ALL=C ldd --version
ldd (GNU libc) 2.17
Copyright (C) 2012 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Roland McGrath and Ulrich Drepper.
I downloaded v0.12 from https://github.com/hpc/Spindle/releases/tag/v0.12 and built it.
Prepare the simple application built with BIND_NOW and run with Spindle like the following.
$ cat hello.c
#include <stdio.h>
int main (int argc, char* argv[])
{
printf ("Hello world!\n");
return 0;
}
$ gcc -Wl,-z,now -o hello_bind_now hello.c
SPINDLE_DEBUG=3 TMPDIR='/tmp' spindle --location='/tmp' mpiexec -np 1 spindlemarker $(pwd)/hello_bind_now
<Aug 31 16:19:45> <Launchmon> (INFO): The RM process has just been forked and exec'ed.
<Aug 31 16:19:45> <Launchmon> (INFO): Just continued the RM process out of the first trap
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 247311 RUNNING AT 10.xx.yy.zz
= KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================
Without BIND_NOW option, the application can run with Spindle.
$ gcc -o hello hello.c
SPINDLE_DEBUG=3 TMPDIR='/tmp' spindle --location='/tmp' mpiexec -np 1 spindlemarker $(pwd)/hello
<Aug 31 16:20:26> <Launchmon> (INFO): The RM process has just been forked and exec'ed.
<Aug 31 16:20:26> <Launchmon> (INFO): Just continued the RM process out of the first trap
Hello world!
In the debug output, the SPINDLE client looks stop with the following log.
[Client.0.252100@auditclient_common.c:92] la_objopen - la_objopen(): loading /lib64/libc.so.6, link_map = 0x2b60c23859c8, lmid = LM_ID_BASE, cookie = 0x2b60c2385e30
[Client.0.252100@auditclient_common.c:116] la_activity - la_activity(): cookie = 0x2b60c25685c0; flag = LA_ACT_CONSISTENT
[[email protected]:30] remove_lib_rogot - Checking whether /lib64/libc.so.6 has R GOT
[[email protected]:41] remove_lib_rogot - Changing /lib64/libc.so.6 R GOT to RW GOT from 2b60c2b40000 to 2b60c2b44000
[[email protected]:30] remove_lib_rogot - Checking whether /lib64/ld-linux-x86-64.so.2 has R GOT
[[email protected]:41] remove_lib_rogot - Changing /lib64/ld-linux-x86-64.so.2 R GOT to RW GOT from 2b60c2566000 to 2b60c2567000
[[email protected]:39] spindle_la_activity - la_activity(): cookie = 0x2b60c25685c0; flag = LA_ACT_CONSISTENT
[Server.252113@ldcs_api_listen.c:174] ldcs_listen - Select returned data. Calling callback for fd 14 id=0
[Server.252113@ldcs_audit_server_client_cb.c:61] _ldcs_client_CB - Receiving message from client 0 on fd 14
[Server.252113@ldcs_api_pipe.c:387] _ldcs_read_pipe - before read from fifo 14, bytes_to_read = 8
[Server.252113@ldcs_api_pipe.c:398] _ldcs_read_pipe - read from fifo: 0 bytes ...
[Server.252113@ldcs_api_pipe.c:338] ldcs_recv_msg_static_pipe - Client disconnected. Returning END message
The result of the readelf -d for each application binary.
$ LC_ALL=C readelf -d hello_bind_now
Dynamic section at offset 0xdd8 contains 26 entries:
Tag Type Name/Value
0x0000000000000001 (NEEDED) Shared library: [libc.so.6]
0x000000000000000c (INIT) 0x4003e0
0x000000000000000d (FINI) 0x4005c4
0x0000000000000019 (INIT_ARRAY) 0x600dc0
0x000000000000001b (INIT_ARRAYSZ) 8 (bytes)
0x000000000000001a (FINI_ARRAY) 0x600dc8
0x000000000000001c (FINI_ARRAYSZ) 8 (bytes)
0x000000006ffffef5 (GNU_HASH) 0x400298
0x0000000000000005 (STRTAB) 0x400318
0x0000000000000006 (SYMTAB) 0x4002b8
0x000000000000000a (STRSZ) 61 (bytes)
0x000000000000000b (SYMENT) 24 (bytes)
0x0000000000000015 (DEBUG) 0x0
0x0000000000000003 (PLTGOT) 0x600fc8
0x0000000000000002 (PLTRELSZ) 72 (bytes)
0x0000000000000014 (PLTREL) RELA
0x0000000000000017 (JMPREL) 0x400398
0x0000000000000007 (RELA) 0x400380
0x0000000000000008 (RELASZ) 24 (bytes)
0x0000000000000009 (RELAENT) 24 (bytes)
0x0000000000000018 (BIND_NOW)
0x000000006ffffffb (FLAGS_1) Flags: NOW
0x000000006ffffffe (VERNEED) 0x400360
0x000000006fffffff (VERNEEDNUM) 1
0x000000006ffffff0 (VERSYM) 0x400356
0x0000000000000000 (NULL) 0x0
$
$ LC_ALL=C readelf -d hello
Dynamic section at offset 0xe28 contains 24 entries:
Tag Type Name/Value
0x0000000000000001 (NEEDED) Shared library: [libc.so.6]
0x000000000000000c (INIT) 0x4003e0
0x000000000000000d (FINI) 0x4005c4
0x0000000000000019 (INIT_ARRAY) 0x600e10
0x000000000000001b (INIT_ARRAYSZ) 8 (bytes)
0x000000000000001a (FINI_ARRAY) 0x600e18
0x000000000000001c (FINI_ARRAYSZ) 8 (bytes)
0x000000006ffffef5 (GNU_HASH) 0x400298
0x0000000000000005 (STRTAB) 0x400318
0x0000000000000006 (SYMTAB) 0x4002b8
0x000000000000000a (STRSZ) 61 (bytes)
0x000000000000000b (SYMENT) 24 (bytes)
0x0000000000000015 (DEBUG) 0x0
0x0000000000000003 (PLTGOT) 0x601000
0x0000000000000002 (PLTRELSZ) 72 (bytes)
0x0000000000000014 (PLTREL) RELA
0x0000000000000017 (JMPREL) 0x400398
0x0000000000000007 (RELA) 0x400380
0x0000000000000008 (RELASZ) 24 (bytes)
0x0000000000000009 (RELAENT) 24 (bytes)
0x000000006ffffffe (VERNEED) 0x400360
0x000000006fffffff (VERNEEDNUM) 1
0x000000006ffffff0 (VERSYM) 0x400356
0x0000000000000000 (NULL) 0x0
$
The Spack package needs to be updated to v0.13 - adding [email protected]
to a Spack environment and running spack install
gives the warning:
==> Warning: There is no checksum on file to fetch [email protected] safely.
The currently available spack installation for Spindle pulls release 0.8.1 and fails due to narrowing error in compilation of spindle_logd.cc in the logging directory:
133 CXX spindle_logd.o
>> 134 spindle_logd.cc:65:76: error: narrowing conversion of '255' from 'int' to 'char' inside { } [-Wnarrowing]
135 static char exitcode[8] = { 0x01, 0xff, 0x03, 0xdf, 0x05, 0xbf, 0x07, '\n' };
136 ^
>> 137 spindle_logd.cc:65:76: error: narrowing conversion of '223' from 'int' to 'char' inside { } [-Wnarrowing]
>> 138 spindle_logd.cc:65:76: error: narrowing conversion of '191' from 'int' to 'char' inside { } [-Wnarrowing]
139 CCLD libspindlelogc.la
140 make[2]: *** [Makefile:386: spindle_logd.o] Error 1
141 make[2]: Leaving directory '/tmp/asill/spack-stage/spack-stage-spindle-0.8.1-u6g66hhvbkxfa7n32x2gzferzpurspf3/spack-src/logging'
142 make[1]: *** [Makefile:319: all-recursive] Error 1
143 make[1]: Leaving directory '/tmp/asill/spack-stage/spack-stage-spindle-0.8.1-u6g66hhvbkxfa7n32x2gzferzpurspf3/spack-src'
144 make: *** [Makefile:248: all] Error 2
I can work around this by using ./bin/spack install spindle cxxflags="-Wno-narrowing"
but likely the spack package should be updated and this flag fixed for the older tarball for manual installations.
./configure --enable-sec-none --with-hostbin=/scratch/pmpi/dsolt/WORKSPACE/spindle/myscript.sh
make
make[4]: Entering directory /scratch/pmpi/dsolt/WORKSPACE/spindle/Spindle/src/client/beboot' CC spindle_bootstrap-spindle_bootstrap.o CC spindle_bootstrap-parseloc.o CC spindle_bootstrap-spindle_mkdir.o make[4]: *** No rule to make target
../auditclient/exec_util.c', needed by spindle_bootstrap-exec_util.o'. Stop. make[4]: Leaving directory
/scratch/pmpi/dsolt/WORKSPACE/spindle/Spindle/src/client/beboot'
The license files in Spindle have the incorrect street address for the Free Software Foundation:
59 Temple Place, Suite 330, Boston, MA 02111-1307 USA.
This should be updated to the new address:
51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
https://fedoraproject.org/wiki/Common_Rpmlint_issues#incorrect-fsf-address
This is discussed some in pull request #9.
I discovered yesterday that when I used prefix=/foo/bar/linux-rhel7 for configure, the C preprocessor was expanding that to "/foo/bar/1-rhel7". This is because "linux" was itself a macro which expanded to "1" in the code when $SRC/src/logging/spindle_logc.c was pre-processed.
Both of the options in the above title change Spindle to invoke the application executable from global storage rather than local. The execv in the bootstrapper can fail when pointed at a relative path'd global executable. This doesn't happen for local executables since we construct the path and make sure it's absolute.
The execv in spindle_bootstrap should thus be changed to an execvp.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.