mpip / pfft Goto Github PK
View Code? Open in Web Editor NEWParallel fast Fourier transforms
License: GNU General Public License v3.0
Parallel fast Fourier transforms
License: GNU General Public License v3.0
Is it possible to do 2D transforms on a 2D procmesh?
I tried this and it seem to have ran into a divide by zero error. Here is the stack trace. There are test cases in the source code doing 3D on 3D, so I imaging it may not be very difficult to extend the library to support 2D on 2D?
#0 0x00007fffe2bc744a in pfft_num_blocks (global_block_size=0,
global_array_size=0) at ../../pfft-1.0.8-alpha2-fftw3/kernel/block.c:84
#1 pfft_local_block_offset (which_block=0, global_block_size=0,
global_array_size=0) at ../../pfft-1.0.8-alpha2-fftw3/kernel/block.c:58
#2 pfft_local_block_size_and_offset (global_array_size=0,
global_block_size=0, which_block=0,
local_block_size=local_block_size@entry=0x108f230,
local_block_start=local_block_start@entry=0x108f250)
at ../../pfft-1.0.8-alpha2-fftw3/kernel/block.c:37
#3 0x00007fffe2bbdc98 in pfft_decompose_1d (local_n_start=0x108f250,
local_n=0x108f230, which_block=<optimized out>,
block_size=<optimized out>, pn=<optimized out>)
at ../../pfft-1.0.8-alpha2-fftw3/util/util.c:58
#4 pfft_decompose (pn=<optimized out>, block=<optimized out>,
rnk_pm=<optimized out>, coords_pm=<optimized out>,
local_n=<optimized out>, local_start=<optimized out>)
at ../../pfft-1.0.8-alpha2-fftw3/util/util.c:50
#5 0x00007fffe2bd4f33 in decompose_nontransposed (
local_start=<optimized out>, local_n=<optimized out>,
trafo_flag=<optimized out>, coords_pm=<optimized out>,
rnk_pm=<optimized out>, blk=<optimized out>, n=<optimized out>,
rnk_n=<optimized out>)
at ../../pfft-1.0.8-alpha2-fftw3/kernel/partrafo-transposed.c:381
---Type <return> to continue, or q <return> to quit---
#6 local_size_transposed (rnk_n=2, ni=<optimized out>, no=0x10db6f0, iblock=0x10db7f0, oblock=0x10db810, rnk_pm=2, coords_pm=0x10db850, trafo_flag=8194,
transp_flag=2, local_ni=0x108f210, local_i_start=0x108f230, local_no=0x108f220, local_o_start=0x108f240)
at ../../pfft-1.0.8-alpha2-fftw3/kernel/partrafo-transposed.c:358
#7 0x00007fffe2bd54cf in pfft_local_size_partrafo_transposed (rnk_n=rnk_n@entry=2, n=n@entry=0x10db630, ni=ni@entry=0x109ccc0, no=no@entry=0x10db6f0,
howmany=howmany@entry=1, iblock=iblock@entry=0x10db7f0, oblock=0x10db810, rnk_pm=2, comms_pm=0x109cc60, transp_flag=2, trafo_flags=0x10db770,
local_ni=0x108f210, local_i_start=0x108f230, local_no=0x108f220, local_o_start=0x108f240) at ../../pfft-1.0.8-alpha2-fftw3/kernel/partrafo-transposed.c:108
#8 0x00007fffe2bc9906 in pfft_local_size_partrafo (rnk_n=2, n=0x10af590, ni=0x10af590, no=0x10af590, howmany=howmany@entry=1,
iblock_user=iblock_user@entry=0x0, oblock_user=0x0, comm=-2080374780, trafo_flag_user=8194, pfft_flags=2050, local_ni=0x108f210, local_i_start=0x108f230,
local_no=<optimized out>, local_o_start=0x108f240) at ../../pfft-1.0.8-alpha2-fftw3/kernel/partrafo.c:261
#9 0x00007fffe2bd1551 in pfft_local_size_many_dft_r2c (rnk_n=<optimized out>, n=<optimized out>, ni=<optimized out>, no=<optimized out>,
howmany=howmany@entry=1, iblock=iblock@entry=0x0, oblock=0x0, comm_cart=-2080374780, pfft_flags=2050, local_ni=0x108f210, local_i_start=0x108f230,
local_no=0x108f220, local_o_start=0x108f240) at ../../pfft-1.0.8-alpha2-fftw3/api/api-adv.c:54
#10 0x00007fffe2bc3008 in pfft_local_size_dft_r2c (rnk_n=<optimized out>, n=<optimized out>, comm_cart=<optimized out>, pfft_flags=<optimized out>,
local_ni=<optimized out>, local_i_start=<optimized out>, local_no=0x108f220, local_o_start=0x108f240) at ../../pfft-1.0.8-alpha2-fftw3/api/api-basic.c:910
The documentation seems to suggest it is possible to do in-place r2c/c2r without PFFT_PADDED_R2C.
I don't think it's possible, or is it?
Hi,
I have been testing the scaling of pfft on our cluster (a few thousands of broadwell nodes with 28 cores each). Although the scaling for my problem (3D grid of 128^3, quite a small grid indeed!) is satisfying, I find that its overall performance with respect to a code such as fftwpp (from Bowen's group), which uses a 2D data decomposition, is poor. Indeed, up until 64 CPU's I find that pfft is consistently 10 times slower that fftwpp. I am not surprised that using a 3D data decomposition with respect to a 2D, the performance would downgrade because of extra communications (as explained in the original paper). But the loss of performance buffles me, and I frankly think I might be doing something wrong somewhere in compiling pfft or in linking it to the system fftw3-mpi. Could you give me some clue on this?
Thank you in advance
Max.
I have been trying to compile PFFT with the MKL version of FFTW, which is supposed to be optimised instead of FFTW3 compiled by me. The compilation fails with
"""
checking for fftw_mpi_init in -lfftw3_mpi... no
configure: error: You do not seem to have the MPI part of the FFTW-3.3 library installed.
"""
I tried to tweak the configure script to change the library name (configure line 18201)
fftw3_mpi_LIBS="-lmkl_core -lmkl_intel_lp64 -lmkl_sequential -lpthread -lm"
but this didn't work.
Is there something I should do I'm not doing?
it is due to line 953 in api-basic.c
complex_conjugate(conj_in, conj_out, ths->rnk_n, ths->local_ni);
This line overwrites the input with the output.
This wasn't caught by the test case due to pull request #3
The Fortran interface to pfft_plan_with_nthreads is missing. I use this:
interface
subroutine pfft_plan_with_nthreads(nthreads) bind(C, name="pfft_plan_with_nthreads")
import
integer(C_INT), value :: nthreads
end subroutine
end interface
Hi, you have a broken symlink:
$ find -xtype l
./doc/code/manual_min_c2c.c
Looks like it has a whole history:
$ git log -- $(find -xtype l)
commit 5b625ef4454eaafe4df38b8ead339de5e6b4d6f5
Author: Michael Pippig <[email protected]>
Date: Wed Jul 17 11:00:36 2013 +0200
add pfft manual to build system
$ git log --stat -- $(readlink -f $(find -xtype l))
commit 3c1246470290add168dc9b4965d282471ef94a12
Author: Michael Pippig <[email protected]>
Date: Thu Jun 12 16:04:12 2014 +0200
create 1d procmesh for non-Cartesian communicator per default
tests/manual_min_c2c.c | 52 ----------------------------------------------------
1 file changed, 52 deletions(-)
commit a938fdc7c86f96cc8de2c85ec8ab6e44bff99d73
Author: Michael Pippig <[email protected]>
Date: Wed Jun 11 16:28:54 2014 +0200
manual: introduction, tutorial
tests/manual_min_c2c.c | 52 ++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 52 insertions(+)
commit f9b86da68433c8cba545c7ee6f060a6fef807d2c
Author: Benedikt Morbach <[email protected]>
Date: Tue Jan 21 10:56:37 2014 +0100
pfft: add c2r / c2c comparison tests
tests/manual_min_c2c.c | 52 ----------------------------------------------------
1 file changed, 52 deletions(-)
commit 6f45f1c3972fa25e1b4cd2950246e34d9f795ac9
Author: Michael Pippig <[email protected]>
Date: Fri Jan 17 15:45:46 2014 +0100
remove padding of r2c inputs from interface and testcases
tests/manual_min_c2c.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
commit ad68b4e4d6126952aa21c1d2ced4ed94223aa6c1
Author: Michael Pippig <[email protected]>
Date: Thu Jul 18 02:40:57 2013 +0200
PFFT manual: save intermediate state
tests/manual_min_c2c.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
commit 5b625ef4454eaafe4df38b8ead339de5e6b4d6f5
Author: Michael Pippig <[email protected]>
Date: Wed Jul 17 11:00:36 2013 +0200
add pfft manual to build system
tests/manual_min_c2c.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 50 insertions(+)
The next generation of Intel will have something like 70+ cores (Knight Landing); mass deployment to major computing facilities will be next year.
It may be a good case if PFFT can both scale out and scale in.
For example, the library built from easybuild produces:
$ readelf -d $EBROOTPFFT/lib/libpfft.so
Dynamic section at offset 0x2ed38 contains 30 entries:
Tag Type Name/Value
0x0000000000000001 (NEEDED) Shared library: [libm.so.6]
0x0000000000000001 (NEEDED) Shared library: [libpthread.so.0]
0x0000000000000001 (NEEDED) Shared library: [libmpi.so.40]
0x0000000000000001 (NEEDED) Shared library: [libc.so.6]
0x000000000000000e (SONAME) Library soname: [libpfft.so.0]
0x000000000000001d (RUNPATH) Library runpath: [/opt/EasyBuild/2022a/software/OpenMPI/4.1.4-GCC-11.3.0/lib:/opt/EasyBuild/2022a/software/hwloc/2.7.1-GCCcore-11.3.0/lib:/opt/EasyBuild/2022a/software/libevent/2.1.12-GCCcore-11.3.0/lib]
0x000000000000000c (INIT) 0x8000
0x000000000000000d (FINI) 0x26abc
0x0000000000000019 (INIT_ARRAY) 0x2fd28
0x000000000000001b (INIT_ARRAYSZ) 8 (bytes)
0x000000000000001a (FINI_ARRAY) 0x2fd30
0x000000000000001c (FINI_ARRAYSZ) 8 (bytes)
0x0000000000000004 (HASH) 0x200
0x000000006ffffef5 (GNU_HASH) 0xcb0
0x0000000000000005 (STRTAB) 0x3e40
0x0000000000000006 (SYMTAB) 0x1710
0x000000000000000a (STRSZ) 9421 (bytes)
0x000000000000000b (SYMENT) 24 (bytes)
0x0000000000000003 (PLTGOT) 0x30000
0x0000000000000002 (PLTRELSZ) 5832 (bytes)
0x0000000000000014 (PLTREL) RELA
0x0000000000000017 (JMPREL) 0x68f8
0x0000000000000007 (RELA) 0x66b8
0x0000000000000008 (RELASZ) 576 (bytes)
0x0000000000000009 (RELAENT) 24 (bytes)
0x000000006ffffffe (VERNEED) 0x6658
0x000000006fffffff (VERNEEDNUM) 2
0x000000006ffffff0 (VERSYM) 0x630e
0x000000006ffffff9 (RELACOUNT) 3
0x0000000000000000 (NULL) 0x0
But pfft itself is calling functions specific to fftw3-mpi.so
:
/opt/EasyBuild/2022a/software/binutils/2.38-GCCcore-11.3.0/bin/ld: /opt_buildbot/linux-debian11/sandybridge/EasyBuild/2022a/software/PFFT/1.0.8-alpha-foss-2022a/lib64/libpfft.so: undefined reference to `fftw_mpi_init'
/opt/EasyBuild/2022a/software/binutils/2.38-GCCcore-11.3.0/bin/ld: /opt_buildbot/linux-debian11/sandybridge/EasyBuild/2022a/software/PFFT/1.0.8-alpha-foss-2022a/lib64/libpfft.so: undefined reference to `fftw_mpi_execute_r2r'
/opt/EasyBuild/2022a/software/binutils/2.38-GCCcore-11.3.0/bin/ld: /opt_buildbot/linux-debian11/sandybridge/EasyBuild/2022a/software/PFFT/1.0.8-alpha-foss-2022a/lib64/libpfft.so: undefined reference to `fftw_mpi_plan_many_transpose'
/opt/EasyBuild/2022a/software/binutils/2.38-GCCcore-11.3.0/bin/ld: /opt_buildbot/linux-debian11/sandybridge/EasyBuild/2022a/software/PFFT/1.0.8-alpha-foss-2022a/lib64/libpfft.so: undefined reference to `fftw_mpi_local_size_many_transposed'
/opt/EasyBuild/2022a/software/binutils/2.38-GCCcore-11.3.0/bin/ld: /opt_buildbot/linux-debian11/sandybridge/EasyBuild/2022a/software/PFFT/1.0.8-alpha-foss-2022a/lib64/libpfft.so: undefined reference to `fftw_mpi_cleanup'
On FreeBSD I am getting:
configure:20297: checking for fftw_mpi_init in -lfftw3_mpi
configure:20330: cc -o conftest -O2 -pipe -fno-omit-frame-pointer -fstack-protector-strong -isystem /usr/local/include -fno-strict-aliasing -fno-omit-frame-pointer -isystem /usr/local/include -fstack-protector-strong -L/usr/local/lib conftest.c -lfftw3_mpi -lfftw3 -lm -lmpi >&5
ld: error: /usr/local/lib/libfftw3_mpi.so: undefined reference to ompi_mpi_op_sum
ld: error: /usr/local/lib/libfftw3_mpi.so: undefined reference to ompi_mpi_comm_null
ld: error: /usr/local/lib/libfftw3_mpi.so: undefined reference to ompi_mpi_unsigned
ld: error: /usr/local/lib/libfftw3_mpi.so: undefined reference to MPI_Comm_f2c
ld: error: /usr/local/lib/libfftw3_mpi.so: undefined reference to ompi_mpi_op_land
ld: error: /usr/local/lib/libfftw3_mpi.so: undefined reference to ompi_mpi_unsigned_long
ld: error: /usr/local/lib/libfftw3_mpi.so: undefined reference to ompi_mpi_op_lor
ld: error: /usr/local/lib/libfftw3_mpi.so: undefined reference to ompi_mpi_char
ld: error: /usr/local/lib/libfftw3_mpi.so: undefined reference to ompi_mpi_int
ld: error: /usr/local/lib/libfftw3_mpi.so: undefined reference to ompi_mpi_double
ld: error: /usr/local/lib/libfftw3_mpi.so: undefined reference to ompi_mpi_op_max
cc: error: linker command failed with exit code 1 (use -v to see invocation)
The Ghost Cell part of the API never explicitly confirmed the ghost cell data is appended to the end of the allocated storage space.
The local_start of an empty rank is always set to 0. This is causes unnecessary branching in downstream code. The logical model is simpler if we just think of these 'stencils' as with a size of zero, but offsetted the same way as others.
For example the local_i_start of a 3d r2c transform on a 2x53 domain decomposition(this set-up is sub-optimal) is currently:
([ 0, 512]),
([ 0, 20, 40, 60, 80, 100, 120, 140, 160, 180, 200,
220, 240, 260, 280, 300, 320, 340, 360, 380, 400, 420,
440, 460, 480, 500, 520, 540, 560, 580, 600, 620, 640,
660, 680, 700, 720, 740, 760, 780, 800, 820, 840, 860,
880, 900, 920, 940, 960, 980, 1000, 1020, 0]
I would suggest to change the last 0 to 1020.
The pfft.pc is created but not installed. This should probably got to $PREFIX/lib/pkg-config, so that this works out of the box.
Hello,
I try to compile PFFT for Android. I was expecting to build it for the multiple architecture that I have to support (armeabi-v7a and arm64-v8a) by adding first parameters to the configure
command like this:
./configure CXX="/Users/XXX/Library/Android/sdk/ndk/25.1.8937393/toolchains/llvm/prebuilt/darwin-x86_64/bin/clang++ -target aarch64-linux-android" --prefix=/Users/XXX/Downloads/pfft-master/arm64-v8a
This command will replace gcc by clang as gcc is not able to compile for the two architectures that I need. the configure command works well but when I try to execute make, it complaines that symbol(s) not found for architecture x86_64
.
I looked into the generated makefile and other files and it seems that gcc is used anyway, even if I pass clang in the CC parameters of the configure command.
Any idea on how I could generate the library for the two mentioned architectures?
My latex compiler (Fedora 19) complains about
1: dsfont -- Fedora didn't package it neither.
2: subfigure is deprecated, and new documents shall use subfig
With a 16x16x16 mesh and dividing to 4 processes (1x4, 4x1, 2x2),
the roundtrip error on a gaussian initial field (r->c c->r) can be as big as 0.001.
The error is also large even on 1 process (comparing with numpy.fft), typically around 2e-5 in forward, and cummulates to ~ 0.002 after backward.
I wonder how this compares with FFTWF. And is there anything we can do about it.
The guru FFTW interface allows arbitrarily strided input and output array. PFFT does not.
This is a useful use case in a particle mesh code where the local mesh contains a 'ghost region' that is shared by other processes, but do not participate in the FFT.
If I use the patched version of FFTW3 directly, I get this error with a r2c in place transformation:
PMPI_Alltoall(925): Buffers must not be aliased. Consider using MPI_IN_PLACE or setting MPICH_NO_BUFFER_ALIAS_CHECK
Rank 118 [Mon Sep 14 16:51:53 2015] [c2-3c2s10n2] Fatal error in PMPI_Alltoall: Invalid buffer pointer, error stack:
PMPI_Alltoall(966): MPI_Alltoall(sbuf=0x2aab0cf9a040, scount=524800, MPI_FLOAT, rbuf=0x2aab0cf9a040, rcount=524800, MPI_FLOAT, comm=0xc4000002) failed
I wonder why at all we are doing a Alltoall from the same address? Is it safe to just skip this transformation if I == O?
The attached program for 2D c2c transform with a 1D parallel decomposition crashes. Encountered on several computers and slightly different versions of PFFT. Last test done with 1.0.6-alpha. In this form it crashes with PFFT_PRESERVE_INPUT
. When I use in-place transform, it crashes also with PFFT_DESTROY_INPUT
.
> mpicc -Wall -std=c99 pfft_crash.c -lpfft -lfftw3 -lfftw3_mpi -g
> mpirun -n 12 valgrind./a.out
==13960== Invalid read of size 8
==13960== at 0x5092B69: fftw_plan_destroy_internal (in /usr/lib64/libfftw3.so.3.4.4)
==13960== by 0x5472453: ??? (in /usr/lib64/libfftw3_mpi.so.3.4.4)
==13960== by 0x5094172: ??? (in /usr/lib64/libfftw3.so.3.4.4)
==13960== by 0x5162969: ??? (in /usr/lib64/libfftw3.so.3.4.4)
==13960== by 0x5162B63: fftw_mkapiplan (in /usr/lib64/libfftw3.so.3.4.4)
==13960== by 0x546ED5B: fftw_mpi_plan_many_transpose (in /usr/lib64/libfftw3_mpi.so.3.4.4)
==13960== by 0x4E4DF2F: pfft_plan_global_transp (in /usr/local/lib64/libpfft.so.0.0.0)
==13960== by 0x4E41B03: pfft_plan_partrafo_transposed (in /usr/local/lib64/libpfft.so.0.0.0)
==13960== by 0x4E47536: pfft_plan_partrafo (in /usr/local/lib64/libpfft.so.0.0.0)
==13960== by 0x4E5472F: pfft_plan_many_dft (in /usr/local/lib64/libpfft.so.0.0.0)
==13960== by 0x4E540AB: pfft_plan_dft (in /usr/local/lib64/libpfft.so.0.0.0)
==13960== by 0x400E66: main (pfft_crash.c:44)
==13960== Address 0xb98c240 is 0 bytes inside a block of size 168 free'd
==13960== at 0x4C2A37C: free (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==13960== by 0x5471EFF: fftw_mpi_mkplans_posttranspose (in /usr/lib64/libfftw3_mpi.so.3.4.4)
==13960== by 0x54720C1: ??? (in /usr/lib64/libfftw3_mpi.so.3.4.4)
==13960== by 0x5094172: ??? (in /usr/lib64/libfftw3.so.3.4.4)
==13960== by 0x5162969: ??? (in /usr/lib64/libfftw3.so.3.4.4)
==13960== by 0x5162B63: fftw_mkapiplan (in /usr/lib64/libfftw3.so.3.4.4)
program:
#include <complex.h>
#include <pfft.h>
int main(int argc, char **argv)
{
int np[1];
ptrdiff_t n[2];
ptrdiff_t alloc_local;
ptrdiff_t local_ni[2], local_i_start[2];
ptrdiff_t local_no[2], local_o_start[2];
double err;
pfft_complex *in, *out;
pfft_plan plan_forw=NULL, plan_back=NULL;
MPI_Comm comm_cart_1d;
/* Set size of FFT and process mesh */
n[0] = 200; n[1] = 200;
np[0] = 12;
/* Initialize MPI and PFFT */
MPI_Init(&argc, &argv);
pfft_init();
/* Create two-dimensional process grid of size np[0] x np[1], if possible */
if( pfft_create_procmesh(1, MPI_COMM_WORLD, np, &comm_cart_1d) ){
pfft_fprintf(MPI_COMM_WORLD, stderr, "Error: This test file only works with %d processes.\n", np[0]);
MPI_Finalize();
return 1;
}
/* Get parameters of data distribution */
alloc_local = pfft_local_size_dft(2, n, comm_cart_1d, PFFT_TRANSPOSED_NONE,
local_ni, local_i_start, local_no, local_o_start);
/* Allocate memory */
in = pfft_alloc_complex(alloc_local);
out = pfft_alloc_complex(alloc_local);
/* Plan parallel forward FFT */
plan_forw = pfft_plan_dft(
2, n, in, out, comm_cart_1d, PFFT_FORWARD, PFFT_TRANSPOSED_NONE| PFFT_ESTIMATE| PFFT_PRESERVE_INPUT);
/* Plan parallel backward FFT */
plan_back = pfft_plan_dft(
2, n, out, in, comm_cart_1d, PFFT_BACKWARD, PFFT_TRANSPOSED_NONE| PFFT_ESTIMATE| PFFT_PRESERVE_INPUT);
/* Initialize input with random numbers */
pfft_init_input_complex(2, n, local_ni, local_i_start,
in);
/* execute parallel forward FFT */
pfft_execute(plan_forw);
/* clear the old input */
pfft_clear_input_complex(2, n, local_ni, local_i_start,
in);
/* execute parallel backward FFT */
pfft_execute(plan_back);
/* Scale data */
for(ptrdiff_t l=0; l < local_ni[0] * local_ni[1]; l++)
in[l] /= (n[0]*n[1]);
/* Print error of back transformed data */
err = pfft_check_output_complex(2, n, local_ni, local_i_start, in, comm_cart_1d);
pfft_printf(comm_cart_1d, "Error after one forward and backward trafo of size n=(%td, %td):\n", n[0], n[1]);
pfft_printf(comm_cart_1d, "maxerror = %6.2e;\n", err);
/* free mem and finalize */
pfft_destroy_plan(plan_forw);
pfft_destroy_plan(plan_back);
MPI_Comm_free(&comm_cart_1d);
pfft_free(in); pfft_free(out);
MPI_Finalize();
return 0;
}
There is already a GPU enabled parallel FFT library https://github.com/amirgholami/accfft/ .
AccFFT is written in C++. I wonder if we can port the GPU related code to C and use in pfft as a GPU backend.
Here is the log:
[avmo@kthxps pfft-1.0.8-alpha]$
[avmo@kthxps pfft-1.0.8-alpha]$ ls
api/ doc/ kernel/ tests/ aclocal.m4 bootstrap.sh* config.h.in configure.ac COPYING INSTALL Makefile.in pfft.pc.in TODO
build-aux/ gcell/ m4/ util/ AUTHORS ChangeLog configure* CONVENTIONS fconfig.h.in Makefile.am NEWS README
[avmo@kthxps pfft-1.0.8-alpha]$ export LANG=C
[avmo@kthxps pfft-1.0.8-alpha]$ ./bootstrap.sh
PLEASE IGNORE WARNINGS AND ERRORS
libtoolize: putting auxiliary files in AC_CONFIG_AUX_DIR, 'build-aux'.
libtoolize: linking file 'build-aux/ltmain.sh'
libtoolize: putting macros in AC_CONFIG_MACRO_DIRS, 'm4'.
libtoolize: linking file 'm4/libtool.m4'
libtoolize: linking file 'm4/ltoptions.m4'
libtoolize: linking file 'm4/ltversion.m4'
autoreconf: Entering directory `.'
autoreconf: configure.ac: not using Gettext
autoreconf: running: aclocal --force -I m4
autoreconf: configure.ac: tracing
autoreconf: running: libtoolize --copy --force
libtoolize: putting auxiliary files in AC_CONFIG_AUX_DIR, 'build-aux'.
libtoolize: copying file 'build-aux/ltmain.sh'
libtoolize: putting macros in AC_CONFIG_MACRO_DIRS, 'm4'.
libtoolize: copying file 'm4/libtool.m4'
libtoolize: copying file 'm4/ltoptions.m4'
libtoolize: copying file 'm4/ltsugar.m4'
libtoolize: copying file 'm4/ltversion.m4'
libtoolize: copying file 'm4/lt~obsolete.m4'
autoreconf: running: /usr/bin/autoconf --force
autoreconf: running: /usr/bin/autoheader --force
autoreconf: running: automake --add-missing --copy --force-missing
configure.ac:139: installing 'build-aux/compile'
configure.ac:55: installing 'build-aux/missing'
api/Makefile.am: installing 'build-aux/depcomp'
autoreconf: Leaving directory `.'
[avmo@kthxps pfft-1.0.8-alpha]$ ./configure --prefix=/home/avmo/src/spack/opt/spack/linux-archrolling-x86_64/gcc-7.3.0/pfft-1.0.8-alpha-vg4mvddn4ybvvasdceoodnlxh3xfxv4d CC=/usr/bin/gcc MPICC=/usr/bin/mpicc FC=/usr/bin/gfortran MPIFC=/usr/bin/mpif90
configure: ****************************************************************
configure: * Configuring in common/pfft *
configure: ****************************************************************
checking build system type... x86_64-pc-linux-gnu
checking host system type... x86_64-pc-linux-gnu
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... /usr/bin/mkdir -p
checking for gawk... gawk
checking whether make sets $(MAKE)... yes
checking whether make supports nested variables... yes
checking for style of include used by make... GNU
checking for gcc... /usr/bin/gcc
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables...
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether /usr/bin/gcc accepts -g... yes
checking for /usr/bin/gcc option to accept ISO C89... none needed
checking whether /usr/bin/gcc understands -c and -o together... yes
checking dependency style of /usr/bin/gcc... gcc3
checking for function MPI_Init... no
checking for function MPI_Init in -lmpi... no
checking for function MPI_Init in -lmpich... no
configure: error: in `/home/avmo/tmp/pfft-1.0.8-alpha':
configure: error: PFFT requires an MPI C compiler.
See `config.log' for more details
[avmo@kthxps pfft-1.0.8-alpha]$ pacman -Qi openmpi
Name : openmpi
Version : 3.0.0-1
Description : High performance message passing library (MPI)
Architecture : x86_64
URL : https://www.open-mpi.org
Licenses : custom:OpenMPI
Groups : None
Provides : None
Depends On : libltdl hwloc openssh
Optional Deps : gcc-fortran: fortran support [installed]
Required By : arpack hdf5-openmpi icet ospray python-mpi4py python2-mpi4py
Optional For : boost-libs valgrind vtk vtk-visit vtk6
Conflicts With : None
Replaces : None
Installed Size : 9.52 MiB
Packager : Levente Polyak <[email protected]>
Build Date : Wed 20 Dec 2017 10:26:45 AM CET
Install Date : Mon 08 Jan 2018 02:31:37 PM CET
Install Reason : Installed as a dependency for another package
Install Script : No
Validated By : Signature
I have problems compiling the library using the Cray compiler. It is first detected to be a gcc compiler and then the configure has troubles finding the correct flags and options.
I was able to fix the unknown flag for extending the line width in Fortran by supplying -N 255
in FCFLAGS.
However, the configure than tries to link a C program with a Fortran object with -lgfortran
, which fails.
The actual way how to link them with Cray is just
ftn sub.f90 -c -o sub.o
cc sub.o main.c
The configure fails in this step:
configure: error: linking to Fortran libraries from C fails
See `config.log' for more details.
and config.log
contains several variations to
configure:6540: checking for dummy main to link with Fortran libraries
configure:6574: cc -o conftest -g conftest.c -L/opt/cray/cce/8.3.7/CC/x86-64/lib/x86-64 -L/opt/gcc/4.8.1/snos/lib64 /opt/cray/cce/8.3.7/craylibs/x86-64/libmodules.a /opt/cray/cce/8.3.7/craylibs/x86-64/libomp.a
/opt/cray/cce/8.3.7/craylibs/x86-64/libopenacc.a -L/opt/cray/fftw/3.3.4.1/sandybridge/lib -L/opt/cray/dmapp/default/lib64 -L/opt/cray/mpt/7.1.1/gni/mpich2-cray/83/lib -L/opt/cray/libsci/13.0.1/CRAY/83/sandybridge
/lib -L/opt/cray/rca/1.0.0-2.0501.48090.7.46.ari/lib64 -L/opt/cray/alps/5.1.1-2.0501.8507.1.1.ari/lib64 -L/opt/cray/xpmem/0.1-2.0501.48424.3.3.ari/lib64 -L/opt/cray/dmapp/7.0.1-1.0501.8315.8.4.ari/lib64 -L/opt/cra
y/pmi/5.0.6-1.0000.10439.140.2.ari/lib64 -L/opt/cray/ugni/5.0-1.0501.8253.10.22.ari/lib64 -L/opt/cray/udreg/2.3.2-1.0501.7914.1.13.ari/lib64 -L/opt/cray/atp/1.7.5/lib -L/opt/cray/cce/8.3.7/craylibs/x86-64 -L/opt/c
ray/wlm_detect/1.0-1.0501.47908.2.2.ari/lib64 -lfftw3f_mpi -lfftw3f_threads -lfftw3f -lfftw3_mpi -lfftw3_threads -lfftw3 -lAtpSigHandler -lAtpSigHCommData -lsci_cray_mpi_mp -lsci_cray_mp -lmpichf90_cray -lmpich_cr
ay -lpgas-dmapp -lcray-c++-rts -lcraystdc++ -lxpmem -ldmapp -lpmi -ludreg -lalpslli -lalpsutil -lrca -lwlm_detect -lugni -lomp -lcraymp -lmodules -lfi -lf -lpthread -lcraymath -lm -lgfortran -lquadmath -lu -lrt -l
csup -ltcmalloc_minimal -lstdc++ -L/opt/gcc/4.8.1/snos/lib/gcc/x86_64-suse-linux/4.8.1 -L/opt/cray/cce/8.3.7/cray-binutils/x86_64-unknown-linux-gnu/lib -L//usr/lib64 >&5
CC-1254 craycc: WARNING
The environment variable "CPATH" is not supported.
/opt/cray/cce/8.3.7/craylibs/x86-64/libtcmalloc_minimal.a(tcmalloc.o):(.bss+0x13): multiple definition of `FLAG__namespace_do_not_use_directly_use_DECLARE_bool_instead::FLAGS_tcmalloc_abort_on_large_alloc'
/opt/cray/cce/8.3.7/craylibs/x86-64/libtcmalloc_minimal.a(tcmalloc.o):(.bss+0x13): first defined here
/opt/cray/cce/8.3.7/craylibs/x86-64/libtcmalloc_minimal.a(tcmalloc.o): In function `TCMallocImplementation::GetAllocatedSize(void*)':
tcmalloc.cc:(.text+0x1f0): multiple definition of `TCMallocImplementation::GetAllocatedSize(void*)'
/opt/cray/cce/8.3.7/craylibs/x86-64/libtcmalloc_minimal.a(tcmalloc.o):tcmalloc.cc:(.text+0x1f0): first defined here
/opt/cray/cce/8.3.7/craylibs/x86-64/libtcmalloc_minimal.a(tcmalloc.o): In function `TCMallocImplementation::MarkThreadBusy()':
tcmalloc.cc:(.text+0xe40): multiple definition of `TCMallocImplementation::MarkThreadBusy()'
/opt/cray/cce/8.3.7/craylibs/x86-64/libtcmalloc_minimal.a(tcmalloc.o):tcmalloc.cc:(.text+0xe40): first defined here
/opt/cray/cce/8.3.7/craylibs/x86-64/libtcmalloc_minimal.a(tcmalloc.o):(.bss+0x11): multiple definition of `FLAG__namespace_do_not_use_directly_use_DECLARE_bool_instead::FLAGS_tcmalloc_pad_cacheline'
/opt/cray/cce/8.3.7/craylibs/x86-64/libtcmalloc_minimal.a(tcmalloc.o):(.bss+0x11): first defined here
/opt/cray/cce/8.3.7/craylibs/x86-64/libtcmalloc_minimal.a(tcmalloc.o): In function `tc_version':
tcmalloc.cc:(.text+0x4e40): multiple definition of `tc_version'
/opt/cray/cce/8.3.7/craylibs/x86-64/libtcmalloc_minimal.a(tcmalloc.o):tcmalloc.cc:(.text+0x4e40): first defined here
/opt/cray/cce/8.3.7/craylibs/x86-64/libtcmalloc_minimal.a(tcmalloc.o): In function `tc_set_new_mode':
tcmalloc.cc:(.text+0x4e70): multiple definition of `tc_set_new_mode'
/opt/cray/cce/8.3.7/craylibs/x86-64/libtcmalloc_minimal.a(tcmalloc.o):tcmalloc.cc:(.text+0x4e70): first defined here
/opt/cray/cce/8.3.7/craylibs/x86-64/libtcmalloc_minimal.a(tcmalloc.o): In function `tc_malloc':
How to fix the configure to accept the Cray compilers?
I was trying to install this library on a supercomputing cluster. However, I can only install it locally in my home directory, since I do not have authorisation to install it elsewhere.
After loading the modules intel/2020.4
, intelmpi/2020.4
, openucx/1.13.1
and fftw/3.3.10
, I cloned the git repository and followed the installation instructions and ran ./bootstap.h
followed by ./configure
and make install
.
The Makefile does its thing for some time, then it fails with the following error:
make[2]: Nothing to be done for 'install-exec-am'.
/usr/bin/mkdir -p '/usr/local/include'
/usr/bin/install -c -m 644 pfft.h '/usr/local/include'
/usr/bin/install: cannot create regular file '/usr/local/include/pfft.h': Permission denied
make[2]: *** [Makefile:506: install-includeHEADERS] Error 1
Please advise what I must do to have it installed without needing admin privileges? Thank you.
Running a 4x4x4 transformation on 2x1 decomposition, valgrind gives the following error.
==18470== Uninitialised byte(s) found during client check request
==18470== at 0x3CE045CFE1: ??? (in /usr/lib64/openmpi/lib/libopen-pal.so.6.2.1)
==18470== by 0x3CE1075F74: PMPI_Sendrecv (in /usr/lib64/openmpi/lib/libmpi.so.1.6.0)
==18470== by 0x4301FE: transpose_chunks (in /home/yfeng1/source/cola_halo/a.out)
==18470== by 0x4303F3: apply (in /home/yfeng1/source/cola_halo/a.out)
==18470== by 0x407222: execute_transposed.isra.1 (in /home/yfeng1/source/cola_halo/a.out)
==18470== by 0x407D4C: pfft_execute_full (in /home/yfeng1/source/cola_halo/a.out)
==18470== by 0x402A61: pm_c2r (pmpfft.c:159)
==18470== by 0x4032E5: main (pmpfft.c:279)
==18470== Address 0xa3a53e0 is 0 bytes inside a block of size 192 alloc'd
==18470== at 0x4A08D84: memalign (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==18470== by 0x436824: fftw_malloc_plain (in /home/yfeng1/source/cola_halo/a.out)
==18470== by 0x4300CD: transpose_chunks (in /home/yfeng1/source/cola_halo/a.out)
==18470== by 0x4303F3: apply (in /home/yfeng1/source/cola_halo/a.out)
==18470== by 0x407222: execute_transposed.isra.1 (in /home/yfeng1/source/cola_halo/a.out)
Is there a PFFT way to transpose pencil-decomposed 3D array globally?
Is using a pfft_plan_many_*_skipped function, and skipping all three transforms a viable option?
The current manual is written strictly in Latex, and it may be hard to compile it to HTML pages.
HTML documentation is easier to access than PDF documentation, because 1) it doesn't need a PDF reader and 2) is easier to search; and 3) is paginated by sections rather than by pages.
For HTML generation, I would recommend looking into restructured text and sphinx.
It will produce HTML and pdf documents.
If this is desired I can start a PR start porting tex to .rst .
Do you run valgrind on pfft to check for leaks?
It looks like the c99 syntax (for loops) doesn't work out of box on some compilers. I ran into this after messing around with travis for a while.
In place transform fails on edison.nersc.gov, with intel 14.0.2 20140120.
I'll look deeper into this. Can it be a problem with FFTW ? --since the domain decomposition is 1d in this simple test.
#include <complex.h>
#include <pfft.h>
int main(int argc, char **argv)
{
int np[2];
ptrdiff_t n[3];
ptrdiff_t alloc_local;
ptrdiff_t local_ni[3], local_i_start[3];
ptrdiff_t local_no[3], local_o_start[3];
double err;
pfft_complex *in, *out;
pfft_plan plan_forw=NULL, plan_back=NULL;
MPI_Comm comm_cart_2d;
double data[] = {
-0.51503939, 0.59189672, 0.0478734, -0.48840469, -0.35495284, -0.39181335,
1.86426106, -1.37148975, 2.22627536, -0.11810965, 0.11984837, 0.18259889,
-0.65773926, -1.64623164, -1.14158407, -1.43908939,
} ;
/* Set size of FFT and process mesh */
// n[0] = 29; n[1] = 27; n[2] = 31;
n[0] = 2; n[1] = 3; n[2] = 2;
np[0] = 2; np[1] = 1;
/* Initialize MPI and PFFT */
MPI_Init(&argc, &argv);
pfft_init();
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
/* Create two-dimensional process grid of size np[0] x np[1], if possible */
if( pfft_create_procmesh_1d(MPI_COMM_WORLD, np[0], &comm_cart_2d) ){
pfft_fprintf(MPI_COMM_WORLD, stderr, "Error: This test file only works with %d processes.\n", np[0]*np[1]);
MPI_Finalize();
return 1;
}
/* Get parameters of data distribution */
alloc_local = pfft_local_size_dft_3d(n, comm_cart_2d, PFFT_TRANSPOSED_NONE,
local_ni, local_i_start, local_no, local_o_start);
/* Allocate memory */
in = pfft_alloc_complex(alloc_local);
out = pfft_alloc_complex(alloc_local);
out = in;
/* Plan parallel forward FFT */
plan_forw = pfft_plan_dft_3d(
n, in, out, comm_cart_2d, PFFT_FORWARD, PFFT_TRANSPOSED_NONE| PFFT_MEASURE| PFFT_DESTROY_INPUT);
/* Plan parallel backward FFT */
plan_back = pfft_plan_dft_3d(
n, out, in, comm_cart_2d, PFFT_BACKWARD, PFFT_TRANSPOSED_NONE| PFFT_MEASURE| PFFT_DESTROY_INPUT);
/* Initialize input with random numbers */
pfft_init_input_complex_3d(n, local_ni, local_i_start,
in);
memcpy(in, data, sizeof(double) * alloc_local * 2);
/* execute parallel forward FFT */
pfft_execute(plan_forw);
ptrdiff_t l;
int r;
for (r = 0; r < 2; r++) {
MPI_Barrier(MPI_COMM_WORLD);
if (r != rank) continue;
printf ("out on rank %d :", rank);
for(l=0; l < alloc_local * 2; l ++) {
printf("%g\n", ((double*)out)[l]);
}
}
/* clear the old input */
// pfft_clear_input_complex_3d(n, local_ni, local_i_start,
// in);
/* execute parallel backward FFT */
pfft_execute(plan_back);
/* Scale data */
for (r = 0; r < 2; r++) {
MPI_Barrier(MPI_COMM_WORLD);
if (r != rank) continue;
printf ("on rank %d :", rank);
for(l=0; l < alloc_local * 2; l ++) {
printf("%g %g\n", ((double*)in)[l], data[l]);
}
}
for(l=0; l < local_ni[0] * local_ni[1] * local_ni[2]; l++)
in[l] /= (n[0]*n[1]*n[2]);
/* Print error of back transformed data */
err = pfft_check_output_complex_3d(n, local_ni, local_i_start, in, comm_cart_2d);
pfft_printf(comm_cart_2d, "Error after one forward and backward trafo of size n=(%td, %td, %td):\n", n[0], n[1], n[2]);
pfft_printf(comm_cart_2d, "maxerror = %6.2e;\n", err);
/* free mem and finalize */
pfft_destroy_plan(plan_forw);
pfft_destroy_plan(plan_back);
MPI_Comm_free(&comm_cart_2d);
// pfft_free(in); pfft_free(out);
MPI_Finalize();
return 0;
}
Is it possible to plan for 1d transforms with pfft with size-1 communicator?
I am seeing a divide by zero error in pfft_local_size_dft functions if I pass in a procmesh constructed with rnk_n=1.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.