globalarrays / ga Goto Github PK
View Code? Open in Web Editor NEWPartitioned Global Address Space (PGAS) library for distributed arrays
Home Page: http://hpc.pnl.gov/globalarrays/
License: Other
Partitioned Global Address Space (PGAS) library for distributed arrays
Home Page: http://hpc.pnl.gov/globalarrays/
License: Other
https://github.com/nemequ/icc-travis shows how to install Intel compilers in Travis, although it is a bit brittle due to license issues.
I've done this before (https://github.com/ParRes/Kernels/blob/master/travis/install-intel.sh) and will do it again here at some point.
There are some cases where the assert() macro is used incorrectly, containing code that must always execute. For example, 0122f97 fixed such an issue. We need to grep through the entire code base to make sure we don't do this elsewhere.
Trying to build in fedora rawhide --with-ofi:
src-ofi/comex.c: In function 'getputs_stride':
src-ofi/comex.c:2212:63: error: dereferencing pointer to incomplete type 'const struct iovec'
msg->msg_iov = malloc(ofi_data.rma_iov_limit * sizeof(*msg->msg_iov));
^~~~~~~~~~~~~
Full log at https://kojipkgs.fedoraproject.org//work/tasks/4498/18404498/build.log
I have observed this issue on Cray KNLs when the module craype-mic-knl is loaded. craype-mic-knl triggers the -xmic-avx512 icc option that seems to generate erroneous code out of src-mpi-pr/comex.c
The problem can be reproduced with any NWChem run. I have not tried comex/ga tests, which one should I try?
rsg.F and ga_diag_seq.F use #ifdef ENABLE_EISPACK
, and since config.h unconditionally defines ENABLE_EISPACK to either 0 or 1, we're stuck using the old EISPACK code. Update these two source files to use #if ENABLE_EISPACK
instead.
The autogen.sh fails to build the right set of autotools when only automake needs to be built, while m4, libtool and autoconf are already at the required version. I think that the script fails since the automake build has a hardwired reference to ../autotools/share/aclocal that should come from libtool, but libtool has not been built. A simple solution would be to build libtool even though it has the required version already.
Here are some snippets from the attached log file
+ M4_VERSION=1.4.17
+ LIBTOOL_VERSION=2.4.6
+ AUTOCONF_VERSION=2.69
../ga-5-6-1/autotools/share/aclocal': No such file or directory
abstract_ops.h has two static variables shared between operators.
static double __elem_op_var;
static double __elem_op_var2;
This is beside the fact abstract_ops.h is difficult to use. It's trying to solve the problem of copy-paste coding for the various GA data types in math operations, e.g., scan_add, elem_multiply.
GA and ComEx correctness and performance tests should be organized for easier navigation.
We have received a patch for GAMESS. We need to review it and add it to the github.
There are a few places, e.g. global/src/DP.c, where K&R style function declarations are still in use.
For example:
// K&R syntax
int foo(a, p)
int a;
char *p;
{
return 0;
}
// ANSI syntax
int foo(int a, char *p)
{
return 0;
}
Clean these up. Consider using gcc warnings to locate them, but it might not be bullet proof.
-Wmissing-prototypes -Wmissing-declarations -Wstrict-prototypes
or -ansi -pedantic
.
https://github.com/nemequ/pgi-travis demonstrates how to install PGI in Travis. Testing with PGI is useful because (1) NWChem/GA users use it and (2) GNU Fortran tolerates code that does not conform to ISO Fortran and will break with other Fortran compilers (e.g. Sandia-OpenSHMEM/SOS#371).
m4 version 1.4.13 on cascade does not work
modified travis/install-autotools.sh t get 1.4.16
M4_VERSION_MIN=1.4.16
I could not find any way to build the the MA fortran wrappers in the current CMake infrastructure.
For example, libga.a contains the symbol MA_push_stack, but not MA_push_stack_
When trying to build GA-5.6 release tarball - autoconf2.69, automake 1.11 and libtool 2.4.6 are downloaded and built even when these versions exist.
Hi,
With the new (5.6) GA there seems to be a small glitch when switching off Fortran, e.g.:
./configure --prefix=${HOME}/ga-install --disable-f77 MPICC=mpicc MPICXX=mpicxx
worked with 5.5 but with 5.6 this gives:
configure: error: conditional "F77_INTEL_NO_INLINE" was never defined.
Usually this means the macro was only invoked conditionally.
The F77_INTEL_NO_INLINE test seems to be new to 5.6, perhaps it just needs a default value for --disable-f77 case?
Best wishes,
Andy
Hi Bruce:
Can you please update the following page? Or send me (Abhinav) the documentation, and I will add it to the wiki.
The gather/scatter/scatteracc code has undergone a few revisions over the project lifetime. The latest addition of an alloc function is not thread safe. These routines need additional review and testing in light of certain optimizations made to ARMCI contiguous/strided checks that caused gatscat code to fail when it shouldn't.
Most of the uses of the reduction operator *
are doing "logical and" on 0 and 1. We should instead use something that maps down to MPI_LAND
.
I'm going to add some new collective ops and map at least the GA internal collectives to them where appropriate.
cca/ga_cca_classic/overload.cxx: GA_Lgop(&isEqual, 1, (char *)"*");
ga++/src/overload.cc: GA_Lgop(&isEqual, 1, (char *)"*");
global/src/base.c: pnga_pgroup_gop(p_handle,pnga_type_f2c(MT_F_INT), &status, 1, "*");
global/src/base.c: pnga_gop(pnga_type_f2c(MT_F_INT), &status, 1, "*");
global/src/base.c: pnga_gop(pnga_type_f2c(MT_F_INT), &status, 1, "*");
global/src/base.c: pnga_pgroup_gop(grp_id, pnga_type_f2c(MT_F_INT), &status, 1, "*");
global/src/base.c: pnga_gop(pnga_type_f2c(MT_F_INT), &status, 1, "*");
global/src/elem_alg.c: pnga_gop(pnga_type_f2c(MT_F_INT), &compatible, 1, "*");
global/src/elem_alg.c: pnga_gop(pnga_type_f2c(MT_F_INT), &compatible, 1, "*");
global/src/elem_alg.c: pnga_gop(pnga_type_f2c(MT_F_INT), &compatible, 1, "*");
global/src/elem_alg.c: pnga_gop(pnga_type_f2c(MT_F_INT), &compatible, 1, "*");
global/src/global.npatch.c: pnga_gop(pnga_type_f2c(MT_F_INT), &compatible, 1, "*");
global/src/global.npatch.c: pnga_gop(pnga_type_f2c(MT_F_INT), &compatible_a, 1, "*");
global/src/global.npatch.c: pnga_gop(pnga_type_f2c(MT_F_INT), &compatible_b, 1, "*");
global/src/matrix.c: pnga_gop(pnga_type_f2c(MT_F_INT), &compatible, 1, "*");
global/src/matrix.c: pnga_gop(pnga_type_f2c(MT_F_INT), &compatible, 1, "*");
global/src/matrix.c: pnga_gop(pnga_type_f2c(MT_F_INT), &compatible, 1, "*");
global/testing/unit-tests/ga_dgop.c: GA_Dgop(x, n,"*");
global/testing/unit-tests/ga_lgop.c: GA_Lgop(x, n, "*");
gparrays/testing/testc.c: GA_Igop(&idx,1,"*");
gparrays/testing/testc.c: GA_Igop(&idx,1,"*");
When using NGA_Put64
there are problems if the values are over the 32-bit limit, i.e. they cannot be represented by int
, .eg. see the seg fault below. Even though _my_memcpy
takes a size_t n
, it's passed int bytes
so at least by that point we've lost the 64-bit information. Everything works fine for a non-Comex port, eg. --with-sockets.
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff54fb254 in __memcpy_sse2_unaligned () from /lib64/libc.so.6
(gdb) where
#0 0x00007ffff54fb254 in __memcpy_sse2_unaligned () from /lib64/libc.so.6
#1 0x000000000395f222 in _my_memcpy (dest=0x7ffdb6c6f010, src=0x7ffed4e23010, n=18446744071562067968) at src-mpi/comex.c:139
#2 0x000000000395ff72 in _put_nbi (src=0x7ffed4e23010, dst=0x7ffdb6c6f010, bytes=-2147483648, proc=0) at src-mpi/comex.c:644
#3 0x000000000395fe48 in comex_put (src=0x7ffed4e23010, dst=0x7ffdb6c6f010, bytes=-2147483648, proc=0, group=0) at src-mpi/comex.c:613
#4 0x000000000395b94e in PARMCI_PutS (src_ptr=0x7ffed4e23010, src_stride_arr=0x7fffffffc824, dst_ptr=0x7ffdb6c6f010, dst_stride_arr=0x7fffffffc844, count=0x7fffffffc804, stride_levels=0, proc=0) at src-armci/armci.c:484
#5 0x000000000395e889 in ARMCI_PutS (src_ptr=0x7ffed4e23010, src_stride_arr=0x7fffffffc824, dst_ptr=0x7ffdb6c6f010, dst_stride_arr=0x7fffffffc844, count=0x7fffffffc804, stride_levels=0, proc=0) at src-armci/capi.c:377
#6 0x00000000038d8a76 in ngai_puts (loc_base_ptr=0x7ffed4e23010 "", pbuf=0x7ffed4e23010 "", stride_loc=0x7fffffffc824, prem=0x7ffdb6c6f010 "", stride_rem=0x7fffffffc844, count=0x7fffffffc804, nstrides=0, proc=0, field_off=0, field_size=-1, type_size=8)
at global/src/onesided.c:411
#7 0x00000000038db979 in ngai_put_common (g_a=-1000, lo=0x7fffffffcd90, hi=0x7fffffffcd50, buf=0x7ffed4e23010, ld=0x7fffffffcd10, field_off=0, field_size=-1, nbhandle=0x0) at global/src/onesided.c:708
#8 0x00000000038dea0e in pnga_put (g_a=-1000, lo=0x7fffffffcd90, hi=0x7fffffffcd50, buf=0x7ffed4e23010, ld=0x7fffffffcd10) at global/src/onesided.c:1265
#9 0x00000000038674df in NGA_Put64 (g_a=-1000, lo=0x30901270, hi=0x309017b0, buf=0x7ffed4e23010, ld=0x7fffffffce30) at global/src/capi.c:1657
More static work arrays at the top of sparse.c need to be moved into the functions that actually need them.
In Makefile.am, we see
##############################################################################
# compat
#
# Although the compat directory houses replacements for missing or erroneous
# standard C functions and such sources are conditionally compiled based on
# results from configure tests, without the "random" implementation the
# m4-generated tests always fail for scatter and copy_patch.
libga_la_SOURCES += compat/random.c
First of all, I have no idea if we even still need to use the autotool functionality of LIBOBJS. See https://www.gnu.org/software/automake/manual/html_node/LIBOBJS.html for details.
The main issue here is that random.c is unconditionally compiled. Worse yet, it appears that if we don't override the system random() and srandom() we somehow break some tests. Our compat/random.c appears to be a permissively-licensed copy of BSD's random() from 1983...?
@edoapra, would it be possible to evaluate how this replacing of system provided random() affects NWChem? I would love to remove from GA such strange hacks. A related issue is that GA provides a "drand" implementation for fortran, unconditionally. So this would replace perhaps any drand() functions provided by ifort, for example. Our drand() is a wrapper around the C random() -- the same random() we already replace...
We likely don't really need these since we can use valgrind or other utilities for checking invalid reads/writes.
TravisCI build was showing MPI-MT port to fail, but only occasionally. Upon further review,
int comex_fence_proc(int proc, comex_group_t group)
{
#if DEBUG
printf("[%d] comex_fence_proc(proc=%d, group=%d)\n",
g_state.rank, proc, group);
#endif
comex_wait_all(COMEX_GROUP_WORLD);
return COMEX_SUCCESS;
}
The call to comex_wait_all() is effectively a no-op in this case. No fencing message is initiated, so there is nothing to wait on. This is repeated in the PT and PR implementations. I wonder if we meant to call comex_fence_all() instead? It's potentially a heavy hammer, but it would work.
In pgtest.x we see
> Checking accumulate ...
disjoint ga_acc is OK
overlapping ga_acc is OK
> Checking add ...
[3] ../../comex/src-mpi-pt/comex.c:2387: _put_packed_handler: Assertion `reg_entry' failed[3] Received an Error in Communication: (-1) comex_assert_fail
application called MPI_Abort(comm=0x84000002, -1) - process 3
In ghosts.x we see
using 2 process(es)
Value of pdims( 1 ) is 2
Value of pdims( 2 ) is 1
map( 1) = 1
map( 2) = 1001
map( 3) = 1
*
* Global array creation was successful
*
[3] ../../comex/src-mpi-pt/comex.c:2542: _get_packed_handler: Assertion `reg_entry' failed[3] Received an Error in Communication: (-1) comex_assert_fail
application called MPI_Abort(comm=0x84000002, -1) - process 3
global/src/onesided.c uses a global variable ProcListPerm to locally permute MPI ranks at the initiator of put/get/acc calls in order to avoid contention by always servicing targets in a monotonically increasing order.
Make this thread safe with appropriate malloc()/free() within the scope of the functions.
The current implementation of GA_Sync relies on pnga_pgroup_sync. If the group is not the world group, then the call loops over all processes in the group and calls ARMCI_Fence(group_id, iproc). The current implementation of this function in the MPI3 port is to call MPI_Win_flush(proc, win) on all windows associated with the world group. This is wrong, but I don't think there is any way to implement the ARMCI_Fence operation using MPI RMA that doesn't require an order P data structure. At any rate, I think this implementation is not the way to go, since what we want to do is flush all processes in the group. Unfortunately, we don't seem to have something like ARMCI_FenceGroup. This could be easily implemented in MPI RMA but may cause problems with the other ports. Since there is a comex_fence_all function, it should be easy to implement this operation for any of the MPI based ports, but we may run into problems with the existing IB port.
I added the OFI port to the Travis CI testing. On my macbook running OSX 10.12, I was unable to build the ComEx/OFI port using clang. It can't compile the nested function. This is important to fix if you would take a look.
../../comex/src-ofi/comex.c:2594:5: error: function definition is not allowed
here
{
^
There are also a significant number of warnings due to the deprecated syscall() on OSX 10.12 Here's an example:
../../comex/src-ofi/comex.c:2480:13: warning: 'syscall' is deprecated: first
deprecated in macOS 10.12 - syscall(2) is unsupported; please switch to a
supported interface. For SYS_kdebug_trace use kdebug_signpost().
[-Wdeprecated-declarations]
COMEX_OFI_LOG(DEBUG, " %d: count = %d, stride: %d\n", i, cou...
^
I needed to add -Wno-deprecated-declarations
to my CFLAGS to silence those warnings.
I committed cc4d707 to reduce some of the warnings having to do with printf() format mismatches.
In any case, the nested function will need to be addressed. Thanks.
I could not find any way to build the GA Scalapack interface in the current CMake infrastructure
The current documentation should be improved further.
Integer *_ga_map defined in base.h is not thread safe. It is used in the following files:
It is a convenience variable for calls to pnga_locate_region so that heap memory is allocated only once during GA_Init(). Recommend locally calling malloc() as needed instead of sharing the allocation.
Must also update any macros that assume _ga_map is globally available.
We need to evaluate the thread safety of the nbutil.c file and associated routines. Non-blocking functionality is important to GA users. Currently static linked lists are used for managing non-blocking handles and shared among various static and non-static functions.
MPI_Check
violates the MPI standard because it uses the reserved MPI_
namespace.
This does not break anything in practice and thus is low priority, but it is also a trivial fix.
$(BLAS_LIBS) is not added conditionally or otherwise to the LIBADD automake variable for libcomex. This should be done conditionally based on whether an external BLAS is being used. Otherwise, the MKL libraries likely won't properly load when using shared libraries.
We should create a style guide for GA programming that defines consistent naming conventions for
internal functions that are not used outside of the GA library and macros and try and get these naming conventions implemented in the code. We also should clean up the different memory allocators inside GA and get them on a consistent footing.
Would it be possible to add tags or branches for the previous releases available in the home page: 4.3, 5.0, 5.1, 5.2, 5.3, 5.4, 5.5?
There are a few static function variables in matmul.c.
static short int CYCLIC_DISTR_OPT_FLAG = SET;
static short int CONTIG_CHUNKS_OPT_FLAG = SET;
static short int DIRECT_ACCESS_OPT_FLAG = SET;
NWChemEx has requested a feature in which it would be possible to execute GA_Fence, which would take a GA handle as an argument.
Feature requested by Robert Harrison
Not only is GA_PUSH_NAME, GA_POP_NAME, etc. not thread safe, it is not needed for modern debugging where we have access to stack unwinding et al. Remove all use and associated variables.
I have a very large number of patches related to thread-safety in https://github.com/jeffhammond/ga-old/tree/thread-safe. This issue is a placeholder to track progress towards merging those.
Commit 63b5c76 for comex/src-armci/armci.c causes NWChem to generate erroneous results.
The errors shows on KNL using Intel MPI and mpirun (oddly enough, jobs started with SLURM srun don't have this issue). It's enough to use two nodes with a total of four processes.
It can be reproduced on cascade KNL nodes
One more data point to the Intel MPI behavior: if I switch from the default DAPL network fabric to the OFA network fabric, the error vanishes (this is consistent with what I discovered last week with MPI3 ...).
export I_MPI_FABRICS=shm:ofa
export I_MPI_OFA_PACKET_SIZE=2048
export I_MPI_OFA_NUM_RDMA_CONNECTIONS=-1
export I_MPI_OFA_SWITCHING_TO_RDMA 16
I wonder if we should discourage the usage of DAPL with Intel MPI ... any comment?
If --disable-f77 is given to configure or if the test for a fortran compiler fails, and if comex detects a BLAS library, then the BLAS dependency isn't passed on to the GA linker. You get many linker errors such as
/home/username/ga-git/bld_nofort/comex/../../comex/src-common/acc.h:110: undefined reference to `daxpy_'
The make flags
target doesn't show a dependency on BLAS either.
# ===========================================================================
F77="mpif90"
CC="mpicc"
# Suggested compiler/linker options are as follows.
# GA libraries are installed in /home/username/ga-git/bld_nofort/lib
# GA headers are installed in /home/username/ga-git/bld_nofort/include
#
CPPFLAGS="-I/home/username/ga-git/bld_nofort/include"
#
LDFLAGS="-L/home/username/ga-git/bld_nofort/lib"
#
# For Fortran Programs:
FFLAGS=""
LIBS="-lga"
#
# For C Programs:
CFLAGS=""
LIBS="-lga"
# ===========================================================================
Test xGA on Intel KNL and Omni-path system available on TACC.
The premise of calling GA_Fence_init() and later ending the fence with GA_Fence() is by design not thread safe since state is stored globally between these functions.
Do we consider an API change where we return a handle?
Here is the alphabetical candidate list of unused preprocessor symbols. Remove dead symbols and their associated code blocks where possible.
We should assume at minimum an MPI runtime is linked in and in use.
TCGMSG4/5 are obsolete. They should be removed from the code base including any preprocessor symbols. TCGMSG4 uses 'pfiles' for running in parallel. This complicates the Makefile test suite and should also be removed.
TCGMSG-MPI is the MPI compatibility layer and is always compiled. This has been the transition path for many years now. For example, NWChem can continue to use PBEGIN() as part of TCGMSG-MPI as needed.
There are a few lingering references to PVM that should be removed.
These changes would also let us remove dead code where the preprocessor symbol MSG_COMMS_MPI is used throughout the code.
If I build GA with --with-ofa
and then try to link against it I get:
lib/libarmci.a(armci.o): In function `PARMCI_NbAccS':
armci.c:(.text+0x7b4): undefined reference to `comex_nbacc'
It seems the symbol just doesn't get put into the library:
> nm ga-mvapich2/lib/libarmci.a | grep -i comex_nbacc
U comex_nbacc
U comex_nbaccs
U comex_nbaccv
00000000000004c0 T comex_nbaccs
0000000000000270 T comex_nbaccv
> nm ga-mvapich2/lib/libcomex.a | grep -i comex_nbacc
00000000000004c0 T comex_nbaccs
0000000000000270 T comex_nbaccv
Everything works fine if I instead use --with-openib
, I just wanted to compare if there was any difference in performance between the two, is the --with-openib
still the recommended option?
pnga_update_ghosts() is not thread safe due to the use of a non-blocking handle and the non-blocking handle is not thread safe.
/* work arrays used in all routines */
static Integer dims[MAXDIM], ld[MAXDIM-1];
static Integer lo[MAXDIM],hi[MAXDIM];
static Integer one_arr[MAXDIM]={1,1,1,1,1,1,1};
Move these work array into the functions.
On a side note, snga_copy_old() might be dead code.
Bug found in snga_local_transpose() where it uses an int instead of an Integer type.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.