The current RC, branch dev/0.

I succeed to reproduce this bug in a local windows VM. I get: <div class="snip

Is it here ? PyArray_Free in <a href="https://numpy.or

Windows only -- Access violation on copy of C-contiguous array,about rurlus/carma

Comments (32)

hpwxf commented on August 17, 2024

I succeed to reproduce this bug in a local windows VM.
I get:

The following tests FAILED:
          1 - Test arr_to_mat (SEGFAULT)
          2 - Test arr_to_row (SEGFAULT)
          4 - Test arr_to_cube (SEGFAULT)
         20 - pytest (Not Run)
Errors while running CTest

Now, I will study what happens.

from carma.

hpwxf commented on August 17, 2024

Note for me.
I can reproduce the same error using Mingw with following cmake configuration:

CC=gcc CXX=g++ cmake -G "Unix Makefiles" -DCMAKE_BUILD_TYPE=Debug -DBUILD_TESTS=ON -DPYBIND11_CPP_STANDARD=-std=c++14 -DCMAKE_CXX_FLAGS="-D_hypot=hypot -Wa,-mbig-obj -DARMA_DONT_USE_STD_MUTEX -DHAVE_SNPRINTF -L/c/Python37/libs" -DPYTHON_PREFIX_PATH=/c/Python37 ..

$ ctest -R arr_to
Test project C:/Users/User/carma/build-gcc
    Start 16: Test arr_to_mat
1/4 Test #16: Test arr_to_mat ..................***Exception: SegFault  0.23 sec
    Start 17: Test arr_to_row
2/4 Test #17: Test arr_to_row ..................***Exception: SegFault  0.20 sec
    Start 18: Test arr_to_col
3/4 Test #18: Test arr_to_col ..................Exit code 0xc0000374***Exception:   0.12 sec
    Start 19: Test arr_to_cube
4/4 Test #19: Test arr_to_cube .................Exit code 0xc0000374***Exception:   0.12 sec

NB: The exception code 0xc0000374 indicates a heap corruption

from carma.

RUrlus commented on August 17, 2024

Does this the Mingw reproduction indicate it's more related to the way the heap is managed on Windows than a compiler specific issue?

from carma.

hpwxf commented on August 17, 2024

Does this the Mingw reproduction indicate it's more related to the way the heap is managed on Windows than a compiler specific issue?

Yes, I think so.

from carma.

hpwxf commented on August 17, 2024

Info about my analysis.

For sure, it could be a bug between "how the memory is allocated" and then "how to deallocate it".

I tried many things.

I checked that the pointer arg of arma::memory::release is an heap pointer using GetCurrentThreadStackLimits (doc)
I discovered DrMemory (with option -malloc_callstacks), and in MingW env, I have:

(arr ptr = 0x4257d80)
~~Dr.M~~ Error #509: INVALID HEAP ARGUMENT to free 0x0000000004257d80^M
~~Dr.M~~ # 0 replace_free                                                   [d:\a\drmemory\drmemory\common\alloc_replace.c:2710]^M
~~Dr.M~~ # 1 arma::Mat<>::~Mat                                              [C:/Users/User/carma/third_party/armadillo-code/include/arma
...
~~Dr.M~~ Note: refers to -1 byte(s) before next malloc^M
~~Dr.M~~ Note: next higher malloc: 0x0000000004257d80-0x00000000042583c0^M
~~Dr.M~~ Note: refers to -1600 byte(s) beyond last valid byte in prior malloc^M
~~Dr.M~~ Note: prev lower malloc:  0x0000000004257d80-0x00000000042583c0^M
...
~~Dr.M~~ Error #693: LEAK 1600 direct bytes 0x0000000004257d80-0x00000000042583c0 + 0 indirect bytes^M
~~Dr.M~~ # 0 replace_malloc                                     [d:\a\drmemory\drmemory\common\alloc_replace.c:2580]^M
~~Dr.M~~ # 1 _multiarray_umath.cp37-win_amd64!?                +0x0      (0x00007ffad4822213 <_multiarray_umath.cp37-win_amd64+0x2213>)^
M
~~Dr.M~~ # 2 _multiarray_umath.cp37-win_amd64!?                +0x0      (0x00007ffad4867917 <_multiarray_umath.cp37-win_amd64+0x47917>)
^M
~~Dr.M~~ # 3 _multiarray_umath.cp37-win_amd64!?                +0x0      (0x00007ffad486499d <_multiarray_umath.cp37-win_amd64+0x4499d>)
^M
~~Dr.M~~ # 4 _multiarray_umath.cp37-win_amd64!?                +0x0      (0x00007ffad485bf8c <_multiarray_umath.cp37-win_amd64+0x3bf8c>)
^M
~~Dr.M~~ # 5 c_steal_copy_array                                 [C:/Users/User/carma/include/carma/carma/cnumpy.h:82]^M
~~Dr.M~~ # 6 carma::steal_copy_array<>                          [C:/Users/User/carma/include/carma/carma/cnumpy.h:99]^M
~~Dr.M~~ # 7 carma::arr_to_mat<>                                [C:/Users/User/carma/include/carma/carma/converters.h:112]^M

and in VS env:

(arr ptr = 0000022224FD75A0)
~~Dr.M~~ Error #516: UNADDRESSABLE ACCESS beyond heap bounds: reading 0x0000022224fd759c-0x0000022224fd759d 1 byte(s)
~~Dr.M~~ # 0 ucrtbased.dll!calloc_base                                     +0x51c    (0x00007ffaed960e9c <ucrtbased.dll+0x50e9c>)
~~Dr.M~~ # 1 ucrtbased.dll!aligned_free_dbg                                +0x40     (0x00007ffaed963b51 <ucrtbased.dll+0x53b51>)
~~Dr.M~~ # 2 ucrtbased.dll!aligned_free                                    +0x12     (0x00007ffaed960713 <ucrtbased.dll+0x50713>)

from carma.

hpwxf commented on August 17, 2024

I manged to get a very small reproducer without carma and armadillo.

@RUrlus where did you find doc about the underlying function behind the following code ?

// extracted from numpyai.h
void **api_ptr = reinterpret_cast<void **>(PyCapsule_GetPointer(c.ptr(), NULL));
api.Func##_ = (decltype(api.Func##_)) api_ptr[API_##Func]; for Func == PyArray_Free and API_PyArray_Free = 165

from carma.

hpwxf commented on August 17, 2024

Is it here ? PyArray_Freein https://numpy.org/doc/stable/reference/c-api/array.html
and about the magic number 165 ?

from carma.

hpwxf commented on August 17, 2024

Ok, it got it : numpy/__multiarray_api.h

from carma.

RUrlus commented on August 17, 2024

I followed Pybinds11 approach's in numpy.h

As far as I could see malloc is used by Numpy, Python and Armadillo. For armadillo at least if none of the other backends are used.

from carma.

RUrlus commented on August 17, 2024

I tried using Python's free rather than the aligned free but it doesn't change the outcome.

from carma.

hpwxf commented on August 17, 2024

I'm not sure but I think I got it (not yet a solution).
I dove in numpy and cpython source codes and I remembered that allocation / deallocation in Windows must be in the same module. In our case memory in allocated in a Python dll module and we try to deallocate it in main binary. Even, if we use the same allocator/deallocator couple (malloc/free, new/delete, _aligned_malloc, _aligned_free...) it could be different allocator instances and in that case, it fails.

Refs:

To fix this (or to do a concept proof), PyArray_NewCopy_ function should be rewritten to use the same allocator instance used by armadillo.

from carma.

hpwxf commented on August 17, 2024

I just tested in my tiny reproducer to use PyArray_free (not PyArray_Free) and it seems to be good.

PS: This is not suitable to use it inside armadillo this should support my hypothesis.

PyArray_free (numpy)
- is a define of PyMem_RawFree (numpy)
  - is a redirection to _PyMem_RawFree (cpython)
    - is a trivial static void _PyMem_RawFree(void *ctx, void *ptr) { free(ptr); }) (cpython)

from carma.

hpwxf commented on August 17, 2024

My reproducer:

#include <carma/carma.h>
#include <pybind11/numpy.h>
#include <pybind11/pybind11.h>
#include <armadillo>
#include <catch2/catch.hpp>

namespace py = pybind11;

typedef py::array_t<double, py::array::f_style | py::array::forcecast> fArr;

TEST_CASE("Test arr_to_mat", "[arr_to_mat]") {
    py::module np = py::module::import("numpy");
    py::module np_rand = py::module::import("numpy.random");

    SECTION("C-contiguous; no copy; no strict") {
        bool copy = false;
        bool strict = false;

	std::cerr << "PYALLOC BEGIN" << std::endl;
        py::array_t<double> arr = np_rand.attr("normal")(2, 0, py::make_tuple(100, 2));
        // py::array_t<double> arr = np.attr("zeros")(py::make_tuple(100, 2));
	std::cerr << "PYALLOC END" << std::endl;

        // attributes of the numpy array
        size_t arr_N = arr.size();
        size_t arr_S0 = arr.shape(0);
        size_t arr_S1 = arr.shape(1);
        auto arr_p = arr.unchecked<2>();

        // get buffer for raw pointer
        py::buffer_info info = arr.request();

        // call function to be tested
	{
             std::cout << "BEGIN" << std::endl;
             // arma::Mat<double> M = carma::arr_to_mat<double>(arr, copy, strict);
	     assert(carma::is_c_contiguous_2d(arr) || carma::requires_copy(arr) || copy);
	     // copy and ensure fortran order
	     // data = carma::steal_copy_array<double>(arr.ptr());
	     PyObject* psrc = arr.ptr();
	     py::module m = py::module::import("numpy.core.multiarray");
	     auto c = m.attr("_ARRAY_API");
             void **api_ptr = reinterpret_cast<void **>(PyCapsule_GetPointer(c.ptr(), NULL));
	     PyObject *(*PyArray_NewCopy_)(PyObject *, int);
	     PyArray_NewCopy_ = (decltype(PyArray_NewCopy_)) api_ptr[85];
	     void (*PyArray_Free_)(PyObject *, void* ptr);
	     PyArray_Free_ = (decltype(PyArray_Free_)) api_ptr[165];

            // copy the array to a well behaved F-order
	    PyObject* dest = PyArray_NewCopy_(psrc, NPY_FORTRANORDER);
	    // we steal the memory
	    PyArrayObject_fields* parr = reinterpret_cast<PyArrayObject_fields *>(dest);
	    double * data = reinterpret_cast<double*>(parr->data);
	    std::cout << "Raw data: " << data << std::endl;
#if 1       // original PyObject free
	    parr->data = nullptr;
	    // free the array
	    PyArray_Free_(dest, static_cast<void *>(parr->data));
#endif
#if 1       // arma::mat build (not required but similar to carma code)
	    // return p_arr_to_mat(info, data, true, strict);
   	    bool stolen = true;
	    assert(info.ndim == 2);
	    arma::uword nrows = info.shape[0];
	    arma::uword ncols = info.shape[1];
	    arma::uword nelem = info.size;
	    bool copy = (nelem > arma::arma_config::mat_prealloc) ? false : true;
	    assert(copy == false);
	    std::cout << data << " " << stolen << " " << strict << " " << nelem << " " << copy << std::endl;
	    arma::Mat<double> M(data, nrows, ncols, false, strict);
	    // after stealing Arma has to manage the lifetime of the memory
	    std::cout << "M.mem_state = " << M.mem_state << std::endl;
	    // uncomment following line to reproduce carma dev/0.4.0 behavior
//	    arma::access::rw(M.n_alloc) = nelem;
//	    arma::access::rw(M.mem_state) = 0;

	    double mat_sum = arma::accu(M);

        // variable for test status
        CHECK(arr_N == M.n_elem);
        CHECK(arr_S0 == M.n_rows);
        CHECK(arr_S1 == M.n_cols);
        CHECK(info.ptr != M.memptr());
#endif
	std::cout << "END" << std::endl;
	// the error should occur after here

//	arma::memory::release(data); // if data deallocation if delegate to armadillo
//	PyArray_Free_(dest, static_cast<void *>(data)); // late PyObject free
	PyArray_free(data); // call deallocator in same module as allocation
//      free(data); // free as PyArray_free but not in same module
	}

        // the error should occur before
	std::cout << "AFTER" << std::endl; 
    }

} /* TEST_CASE ARR_TO_ROW */

from carma.

hpwxf commented on August 17, 2024

I tried a "not perfect fix" in branch https://github.com/hpwxf/carma/tree/dev/0.4.0, it seems to be better cf https://travis-ci.com/github/hpwxf/carma/jobs/472516980 .

There remains only one failed case : "Test arr_to_cube" ("Test arr_to_mat" is solved).

from carma.

RUrlus commented on August 17, 2024

@hpwxf Thanks for all the work and finding the issue! I was not aware of the (de-)allocation requirement for Windows.

A clean fix could be to use the Numpy (de-)allocator inside armadillo, considering it's calling malloc anyway and I believe it will always yield aligned blocks of memory. I'll try tonight.

If it works we could ask the Armadillo devs wether they would be willing to include a compile time definition for this.
Less nice solution is to ship a fork of armadillo with CARMA.

I'm not keen on having to potentially copy twice.

from carma.

hpwxf commented on August 17, 2024

Armadillo uses a statically defined deallocator. Even if you can convince armadillo to add PyArray_free, it cannot manage with the same code, data coming from Python and from C++ code (PyArray_free cannot deallocate data allocated by C++ world). If armadillo should change, the best idea seems to be to add a reference to a deallocator functor (but this will break the API).

For carma, my "not perfect fix" was to demonstrate that if data are allocated in C++ side, armadillo can manage them.
To do so in a proper way, I suggest replacing PyArray_NewCopy with something like PyArray_New, PyArray_SimpleNewFromData or PyArray_SimpleNewFromDescr(cf https://numpy.org/doc/stable/reference/c-api/array.html). By that way, you can allocate by hand (C++ manner) the data buffer and use it in Python so that Python will fill it.

If it is not clear, I can try to update my fix demo using a such technique (I don't know yet when I could do this, maybe at the end of the week or next week.)

from carma.

hpwxf commented on August 17, 2024

NB: PyArray_NewFromDescr_int seems to be the ultimate method to that cf: https://github.com/numpy/numpy/blob/2908338b19c26c043eab61c2af7bdff96b02b1bc/numpy/core/src/multiarray/ctors.c#L842

from carma.

RUrlus commented on August 17, 2024

I get what your solution, haven't had time to implement it yet. Downside will be that you can't steal the memory on Windows using this approach. You can still borrow it but you would need Armadillo to de-allocate with PyArray_free in order to steal. This in turn would mean that they would also have to allocate using Numpy's malloc as we can't feasible track which arma objects operate on stolen memory.

from carma.

hpwxf commented on August 17, 2024

I'm going to try to do it for fun.

from carma.

hpwxf commented on August 17, 2024

I know that my suggested solution is not perfect and an additional will be required in some cases in Windows OS.
Nevertheless, I have tried to code this idea but it is not yet functional and I have an issue that I don't understand (even if after reading numpy source code).

The following code sample runs well if data is null. However, if I allocate data before, at the end data pointer contains the value of src PyArray but with transposed data (but not transposed shape).

    PyObject* dest = api.PyArray_NewFromDescr_(
        subok ? Py_TYPE(src) : &PyArray_Type,
        dtype,
        ndim,
        dims,
        strides,
        data,
        NPY_FORTRANORDER | ((data) ? ~NPY_ARRAY_OWNDATA : 0),
        subok ? (PyObject*)src : NULL);

    api.PyArray_CopyInto_((PyArrayObject*)dest, (PyArrayObject*)src);

For example

import numpy as np
import test_carma as carma
sample = np.arange(20).reshape(10,2)
mat = carma.tc_out_mat(sample)
print(sample+1)
print(mat)

prints (arrays should have same shape and values)

[[ 1  2]
 [ 3  4]
 [ 5  6]
 [ 7  8]
 [ 9 10]
 [11 12]
 [13 14]
 [15 16]
 [17 18]
 [19 20]]
[[ 1. 11.]
 [ 2. 12.]
 [ 3. 13.]
 [ 4. 14.]
 [ 5. 15.]
 [ 6. 16.]
 [ 7. 17.]
 [ 8. 18.]
 [ 9. 19.]
 [10. 20.]]

Do you have any idea ?

from carma.

RUrlus commented on August 17, 2024

This is an issue related to going from C-order (row contiguous) to Fortran-order (column contiguous). PyArray_CopyInto is a straight memcpy it seems:

int _PyArray_CopyInto(PyArrayObject* dest, PyArrayObject* src)
{
    memcpy(PyArray_DATA(dest), PyArray_DATA(src), PyArray_NBYTES(dest));
    return 0;
}

The only reference I could find was this old thread which wasn't very helpful.
Have you checked the strides and order of the returned array with respectively:

import numpy as np
import test_carma as carma
sample = np.arange(20).reshape(10,2)
mat = carma.tc_out_mat(sample)
print(mat.__array_interface__)
print(mat.flags)

It could be we're seeing a mismatch between the order and strides.

from carma.

RUrlus commented on August 17, 2024

I think the issue is with the strides, based on the latest commit in your branch strides is NULL which Numpy interprets as C-order. So the Numpy acts as the memory is row contiguous but the flag is set to FORTRAN order

from carma.

hpwxf commented on August 17, 2024

Ok; thank you; I will try something with your info.
NBs:
In my test case:

input data (src) is NPY_ARRAY_C_CONTIGUOUS
when data is null (and strides is null)
- it creates internally a new data buffer (new address); the result is perfect (right ordering) even if I use the same PyArray_CopyInto
- the result is good only if order is NPY_FORTRANORDER or NPY_KEEPORDER (but not with NPY_CORDER even if input data is NPY_ARRAY_C_CONTIGUOUS)
when data is not null (allocated before by hand)
- whatever order is (NPY_FORTRANORDER, NPY_CORDER or NPY_KEEPORDER) the result is the (bad) result.

The only difference is how/where data has been allocated (inside Numpy or by hand). I assume I need some flags.

from carma.

hpwxf commented on August 17, 2024

NB: in numpy 1.9.x, my version of PyArray_CopyInto is

NPY_NO_EXPORT int
PyArray_CopyInto(PyArrayObject *dst, PyArrayObject *src)
{
    return PyArray_AssignArray(dst, src, NULL, NPY_UNSAFE_CASTING);
}

and PyArray_AssignArray is a quite long function without any memcpy.

from carma.

hpwxf commented on August 17, 2024

At least, I will do a standalone reproducer and submit it to numpy specialists.

from carma.

RUrlus commented on August 17, 2024

the result is good only if order is NPY_FORTRANORDER or NPY_KEEPORDER (but not with NPY_CORDER even if input data is NPY_ARRAY_C_CONTIGUOUS)

This is because on the conversion out we assume that we have Fortran order when setting strides.

when data is not null (allocated before by hand); whatever order is (NPY_FORTRANORDER, NPY_CORDER or NPY_KEEPORDER) the result is the (bad) result.

So the only thing I can think of that if the data is already allocated and strides is null numpy assumes it C-order regardless of the flag set (e.g. NPY_FORTRANORDER)

from carma.

RUrlus commented on August 17, 2024

    npy_int* strides = reinterpret_cast<npy_int *>(reinterpret_cast<PyArrayObject_fields *>(src));
    npy_int tmp = strides[0];
    strides[0] = strides[1];
    strides[1] = tmp;

should eliminate stride issue with NULL

from carma.

hpwxf commented on August 17, 2024

Now, it is almost OK. Exactly as expected: no extra copy on with Windows.

from carma.

hpwxf commented on August 17, 2024

By the way, I found is another bug with Visual Studio (not gcc on Windows) in this test (Test arr_to_cube : F-contiguous; no copy; strict)
To get CHECK(info.ptr == M.memptr()), as you expect, you need to avoid any copy of Cube object. If any copy occurs since strict==true you will have a data copy.
Unfortunatly, with Visual Studio 16 (2019) does not enable NRVO optimization for return dest; in p_arr_to_cube.

Starting from p_arr_to_cube, you can see this in the armadillo tracing:

DATA PTR = 000001842056D120 # data ptr from numpy object before building a Cube
@ __cdecl arma::Cube<double>::Cube(double *,const unsigned __int64,const unsigned __int64,const unsigned __int64,const bool,const bool,c
onst bool) [this = 00000003618FEB40]
@ void __cdecl arma::Cube<double>::create_mat(void)
DATA PTR = 000001842056D120 # data ptr from numpy object after building a Cube
@ __cdecl arma::Cube<double>::Cube(class arma::Cube<double> &&) [this = 00000003618FF0A0]
@ __cdecl arma::Cube<double>::Cube(class arma::Cube<double> &&) [this = 618ff0a0   in_cube = 618feb40]
@ void __cdecl arma::Cube<double>::steal_mem(class arma::Cube<double> &)
@ class arma::Cube<double> &__cdecl arma::Cube<double>::operator =(const class arma::Cube<double> &) [this = 618ff0a0   in_cube = 618feb
40]
@ void __cdecl arma::Cube<double>::init_warm(const unsigned __int64,const unsigned __int64,const unsigned __int64) [in_n_rows = 100, in_
n_cols = 2, in_n_slices = 2]
@ void __cdecl arma::Cube<double>::delete_mat(void)
Cube::init(): acquiring memory
@ void __cdecl arma::Cube<double>::create_mat(void)
@ __cdecl arma::Cube<double>::~Cube(void) [this = 00000003618FEB40]
@ void __cdecl arma::Cube<double>::delete_mat(void)
DATA PTR = 0000018420569280 # data ptr of built Cube after the call of carma::arr_to_cube
@ double __cdecl arma::accu<class arma::Cube<double>>(const struct arma::BaseCube<double,class arma::Cube<double> > &)
@ __cdecl arma::ProxyCube<class arma::Cube<double> >::ProxyCube(const class arma::Cube<double> &)
@ __cdecl arma::unwrap_cube<class arma::Cube<double> >::unwrap_cube(const class arma::Cube<double> &)

That's why I get this failure:

C:\Users\User\carma\tests\src\test_arr_to_mat_native.cpp(287): FAILED:
  CHECK( info.ptr == M.memptr() )
with expansion:
  0x000001842056d120 == 0x0000018420569280

from carma.

hpwxf commented on August 17, 2024

I have to clean the code and this issue will be solved.

from carma.

hpwxf commented on August 17, 2024

PR ready.

from carma.

RUrlus commented on August 17, 2024

Closed by #72

from carma.

Windows only -- Access violation on copy of C-contiguous array about carma HOT 32 CLOSED

Comments (32)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent