Git Product home page Git Product logo

rmm's Introduction

 RMM: RAPIDS Memory Manager

NOTE: For the latest stable README.md ensure you are on the main branch.

Resources

Overview

Achieving optimal performance in GPU-centric workflows frequently requires customizing how host and device memory are allocated. For example, using "pinned" host memory for asynchronous host <-> device memory transfers, or using a device memory pool sub-allocator to reduce the cost of dynamic device memory allocation.

The goal of the RAPIDS Memory Manager (RMM) is to provide:

  • A common interface that allows customizing device and host memory allocation
  • A collection of implementations of the interface
  • A collection of data structures that use the interface for memory allocation

For information on the interface RMM provides and how to use RMM in your C++ code, see below.

For a walkthrough about the design of the RAPIDS Memory Manager, read Fast, Flexible Allocation for NVIDIA CUDA with RAPIDS Memory Manager on the NVIDIA Developer Blog.

Installation

Conda

RMM can be installed with Conda (miniconda, or the full Anaconda distribution) from the rapidsai channel:

conda install -c rapidsai -c conda-forge -c nvidia rmm cuda-version=12.0

We also provide nightly Conda packages built from the HEAD of our latest development branch.

Note: RMM is supported only on Linux, and only tested with Python versions 3.10 and 3.11.

Note: The RMM package from Conda requires building with GCC 9 or later. Otherwise, your application may fail to build.

See the Get RAPIDS version picker for more OS and version info.

Building from Source

Get RMM Dependencies

Compiler requirements:

  • gcc version 9.3+
  • nvcc version 11.4+
  • cmake version 3.26.4+

CUDA/GPU requirements:

GPU Support:

  • RMM is tested and supported only on Volta architecture and newer (Compute Capability 7.0+). It may work on earlier architectures.

Python requirements:

  • rapids-build-backend (available from PyPI or the rapidsai conda channel)
  • scikit-build-core
  • cuda-python
  • cython

For more details, see pyproject.toml

Script to build RMM from source

To install RMM from source, ensure the dependencies are met and follow the steps below:

  • Clone the repository and submodules
$ git clone --recurse-submodules https://github.com/rapidsai/rmm.git
$ cd rmm
  • Create the conda development environment rmm_dev
# create the conda environment (assuming in base `rmm` directory)
$ conda env create --name rmm_dev --file conda/environments/all_cuda-118_arch-x86_64.yaml
# activate the environment
$ conda activate rmm_dev
  • Build and install librmm using cmake & make. CMake depends on the nvcc executable being on your path or defined in CUDACXX environment variable.
$ mkdir build                                       # make a build directory
$ cd build                                          # enter the build directory
$ cmake .. -DCMAKE_INSTALL_PREFIX=/install/path     # configure cmake ... use $CONDA_PREFIX if you're using Anaconda
$ make -j                                           # compile the library librmm.so ... '-j' will start a parallel job using the number of physical cores available on your system
$ make install                                      # install the library librmm.so to '/install/path'
  • Building and installing librmm and rmm using build.sh. Build.sh creates build dir at root of git repository. build.sh depends on the nvcc executable being on your path or defined in CUDACXX environment variable.
$ ./build.sh -h                                     # Display help and exit
$ ./build.sh -n librmm                              # Build librmm without installing
$ ./build.sh -n rmm                                 # Build rmm without installing
$ ./build.sh -n librmm rmm                          # Build librmm and rmm without installing
$ ./build.sh librmm rmm                             # Build and install librmm and rmm
  • To run tests (Optional):
$ cd build (if you are not already in build directory)
$ make test
  • Build, install, and test the rmm python package, in the python folder:
# In the root rmm directory
$ python -m pip install -e ./python/rmm
$ pytest -v

Done! You are ready to develop for the RMM OSS project.

Caching third-party dependencies

RMM uses CPM.cmake to handle third-party dependencies like spdlog, Thrust, GoogleTest, GoogleBenchmark. In general you won't have to worry about it. If CMake finds an appropriate version on your system, it uses it (you can help it along by setting CMAKE_PREFIX_PATH to point to the installed location). Otherwise those dependencies will be downloaded as part of the build.

If you frequently start new builds from scratch, consider setting the environment variable CPM_SOURCE_CACHE to an external download directory to avoid repeated downloads of the third-party dependencies.

Using RMM in a downstream CMake project

The installed RMM library provides a set of config files that makes it easy to integrate RMM into your own CMake project. In your CMakeLists.txt, just add

find_package(rmm [VERSION])
# ...
target_link_libraries(<your-target> (PRIVATE|PUBLIC|INTERFACE) rmm::rmm)

Since RMM is a header-only library, this does not actually link RMM, but it makes the headers available and pulls in transitive dependencies. If RMM is not installed in a default location, use CMAKE_PREFIX_PATH or rmm_ROOT to point to its location.

One of RMM's dependencies is the Thrust library, so the above automatically pulls in Thrust by means of a dependency on the rmm::Thrust target. By default it uses the standard configuration of Thrust. If you want to customize it, you can set the variables THRUST_HOST_SYSTEM and THRUST_DEVICE_SYSTEM; see Thrust's CMake documentation.

Using CPM to manage RMM

RMM uses CPM.cmake to manage its dependencies, including CCCL, and you can use CPM for your project's dependency on RMM.

There is an issue with using CPM's single-argument compact syntax for RMM/CCCL as it transitively marks targets as SYSTEM dependencies. This causes the CCCL headers pulled in through CPM to be of lower priority to the preprocessor than the (potentially outdated) CCCL headers provided by the CUDA SDK. To avoid this issue, use CPM's multi-argument syntax instead:

CPMAddPackage(NAME rmm [VERSION]
              GITHUB_REPOSITORY rapidsai/rmm
              SYSTEM Off)
# ...
target_link_libraries(<your-target> (PRIVATE|PUBLIC|INTERFACE) rmm::rmm)

Using RMM in C++

The first goal of RMM is to provide a common interface for device and host memory allocation. This allows both users and implementers of custom allocation logic to program to a single interface.

To this end, RMM defines two abstract interface classes:

These classes are based on the std::pmr::memory_resource interface class introduced in C++17 for polymorphic memory allocation.

device_memory_resource

rmm::mr::device_memory_resource is the base class that defines the interface for allocating and freeing device memory.

It has two key functions:

  1. void* device_memory_resource::allocate(std::size_t bytes, cuda_stream_view s)

    • Returns a pointer to an allocation of at least bytes bytes.
  2. void device_memory_resource::deallocate(void* p, std::size_t bytes, cuda_stream_view s)

    • Reclaims a previous allocation of size bytes pointed to by p.
    • p must have been returned by a previous call to allocate(bytes), otherwise behavior is undefined

It is up to a derived class to provide implementations of these functions. See available resources for example device_memory_resource derived classes.

Unlike std::pmr::memory_resource, rmm::mr::device_memory_resource does not allow specifying an alignment argument. All allocations are required to be aligned to at least 256B. Furthermore, device_memory_resource adds an additional cuda_stream_view argument to allow specifying the stream on which to perform the (de)allocation.

Stream-ordered Memory Allocation

rmm::mr::device_memory_resource is a base class that provides stream-ordered memory allocation. This allows optimizations such as re-using memory deallocated on the same stream without the overhead of synchronization.

A call to device_memory_resource::allocate(bytes, stream_a) returns a pointer that is valid to use on stream_a. Using the memory on a different stream (say stream_b) is Undefined Behavior unless the two streams are first synchronized, for example by using cudaStreamSynchronize(stream_a) or by recording a CUDA event on stream_a and then calling cudaStreamWaitEvent(stream_b, event).

The stream specified to device_memory_resource::deallocate should be a stream on which it is valid to use the deallocated memory immediately for another allocation. Typically this is the stream on which the allocation was last used before the call to deallocate. The passed stream may be used internally by a device_memory_resource for managing available memory with minimal synchronization, and it may also be synchronized at a later time, for example using a call to cudaStreamSynchronize().

For this reason, it is Undefined Behavior to destroy a CUDA stream that is passed to device_memory_resource::deallocate. If the stream on which the allocation was last used has been destroyed before calling deallocate or it is known that it will be destroyed, it is likely better to synchronize the stream (before destroying it) and then pass a different stream to deallocate (e.g. the default stream).

Note that device memory data structures such as rmm::device_buffer and rmm::device_uvector follow these stream-ordered memory allocation semantics and rules.

For further information about stream-ordered memory allocation semantics, read Using the NVIDIA CUDA Stream-Ordered Memory Allocator on the NVIDIA Developer Blog.

Available Device Resources

RMM provides several device_memory_resource derived classes to satisfy various user requirements. For more detailed information about these resources, see their respective documentation.

cuda_memory_resource

Allocates and frees device memory using cudaMalloc and cudaFree.

managed_memory_resource

Allocates and frees device memory using cudaMallocManaged and cudaFree.

Note that managed_memory_resource cannot be used with NVIDIA Virtual GPU Software (vGPU, for use with virtual machines or hypervisors) because NVIDIA CUDA Unified Memory is not supported by NVIDIA vGPU.

pool_memory_resource

A coalescing, best-fit pool sub-allocator.

fixed_size_memory_resource

A memory resource that can only allocate a single fixed size. Average allocation and deallocation cost is constant.

binning_memory_resource

Configurable to use multiple upstream memory resources for allocations that fall within different bin sizes. Often configured with multiple bins backed by fixed_size_memory_resources and a single pool_memory_resource for allocations larger than the largest bin size.

Default Resources and Per-device Resources

RMM users commonly need to configure a device_memory_resource object to use for all allocations where another resource has not explicitly been provided. A common example is configuring a pool_memory_resource to use for all allocations to get fast dynamic allocation.

To enable this use case, RMM provides the concept of a "default" device_memory_resource. This resource is used when another is not explicitly provided.

Accessing and modifying the default resource is done through two functions:

  • device_memory_resource* get_current_device_resource()

    • Returns a pointer to the default resource for the current CUDA device.
    • The initial default memory resource is an instance of cuda_memory_resource.
    • This function is thread safe with respect to concurrent calls to it and set_current_device_resource().
    • For more explicit control, you can use get_per_device_resource(), which takes a device ID.
  • device_memory_resource* set_current_device_resource(device_memory_resource* new_mr)

    • Updates the default memory resource pointer for the current CUDA device to new_mr
    • Returns the previous default resource pointer
    • If new_mr is nullptr, then resets the default resource to cuda_memory_resource
    • This function is thread safe with respect to concurrent calls to it and get_current_device_resource()
    • For more explicit control, you can use set_per_device_resource(), which takes a device ID.

Example

rmm::mr::cuda_memory_resource cuda_mr;
// Construct a resource that uses a coalescing best-fit pool allocator
// With the pool initially half of available device memory
auto initial_size = rmm::percent_of_free_device_memory(50);
rmm::mr::pool_memory_resource<rmm::mr::cuda_memory_resource> pool_mr{&cuda_mr, initial_size};
rmm::mr::set_current_device_resource(&pool_mr); // Updates the current device resource pointer to `pool_mr`
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource(); // Points to `pool_mr`

Multiple Devices

A device_memory_resource should only be used when the active CUDA device is the same device that was active when the device_memory_resource was created. Otherwise behavior is undefined.

If a device_memory_resource is used with a stream associated with a different CUDA device than the device for which the memory resource was created, behavior is undefined.

Creating a device_memory_resource for each device requires care to set the current device before creating each resource, and to maintain the lifetime of the resources as long as they are set as per-device resources. Here is an example loop that creates unique_ptrs to pool_memory_resource objects for each device and sets them as the per-device resource for that device.

using pool_mr = rmm::mr::pool_memory_resource<rmm::mr::cuda_memory_resource>;
std::vector<unique_ptr<pool_mr>> per_device_pools;
for(int i = 0; i < N; ++i) {
  cudaSetDevice(i); // set device i before creating MR
  // Use a vector of unique_ptr to maintain the lifetime of the MRs
  // Note: for brevity, omitting creation of upstream and computing initial_size
  per_device_pools.push_back(std::make_unique<pool_mr>(upstream, initial_size));
  // Set the per-device resource for device i
  set_per_device_resource(cuda_device_id{i}, &per_device_pools.back());
}

Note that the CUDA device that is current when creating a device_memory_resource must also be current any time that device_memory_resource is used to deallocate memory, including in a destructor. The RAII class rmm::device_buffer and classes that use it as a backing store (rmm::device_scalar and rmm::device_uvector) handle this by storing the active device when the constructor is called, and then ensuring that the stored device is active whenever an allocation or deallocation is performed (including in the destructor). The user must therefore only ensure that the device active during creation of an rmm::device_buffer matches the active device of the memory resource being used.

Here is an incorrect example that creates a memory resource on device zero and then uses it to allocate a device_buffer on device one:

{
  RMM_CUDA_TRY(cudaSetDevice(0));
  auto mr = rmm::mr::cuda_memory_resource{};
  {
    RMM_CUDA_TRY(cudaSetDevice(1));
    // Invalid, current device is 1, but MR is only valid for device 0
    rmm::device_buffer buf(16, rmm::cuda_stream_default, &mr);
  }
}

A correct example creates the device buffer with device zero active. After that it is safe to switch devices and let the buffer go out of scope and destruct with a different device active. For example, this code is correct:

{
  RMM_CUDA_TRY(cudaSetDevice(0));
  auto mr = rmm::mr::cuda_memory_resource{};
  rmm::device_buffer buf(16, rmm::cuda_stream_default, &mr);
  RMM_CUDA_TRY(cudaSetDevice(1));
  ...
  // No need to switch back to device 0 before ~buf runs
}

Use of rmm::device_vector with multiple devices

rmm:device_vector uses an rmm::mr::thrust_allocator to enable thrust::device_vector to allocate and deallocate memory using RMM. As such, the usual rules for usage of the backing memory resource apply: the active device must match the active device at resource construction time. To facilitate use in an RAII setting, rmm::mr::thrust_allocator records the active device at construction time and ensures that device is active whenever it allocates or deallocates memory. Usage of rmm::device_vector with multiple devices is therefore the same as rmm::device_buffer. One must create device_vectors with the correct device active, but it is safe to destroy them with a different active device.

For example, recapitulating the previous example using rmm::device_vector:

{
  RMM_CUDA_TRY(cudaSetDevice(0));
  auto mr = rmm::mr::cuda_memory_resource{};
  rmm::device_vector<int> vec(16, rmm::mr::thrust_allocator<int>(rmm::cuda_stream_default, &mr));
  RMM_CUDA_TRY(cudaSetDevice(1));
  ...
  // No need to switch back to device 0 before ~vec runs
}

Note

Although allocation and deallocation in the thrust_allocator run with the correct active device, modification of rmm::device_vector might necessitate a kernel launch, and this must run with the correct device active. For example, .resize() might both allocate and launch a kernel to initialize new elements: the user must arrange for this kernel launch to occur with the correct device for the memory resource active.

cuda_stream_view and cuda_stream

rmm::cuda_stream_view is a simple non-owning wrapper around a CUDA cudaStream_t. This wrapper's purpose is to provide strong type safety for stream types. (cudaStream_t is an alias for a pointer, which can lead to ambiguity in APIs when it is assigned 0.) All RMM stream-ordered APIs take a rmm::cuda_stream_view argument.

rmm::cuda_stream is a simple owning wrapper around a CUDA cudaStream_t. This class provides RAII semantics (constructor creates the CUDA stream, destructor destroys it). An rmm::cuda_stream can never represent the CUDA default stream or per-thread default stream; it only ever represents a single non-default stream. rmm::cuda_stream cannot be copied, but can be moved.

cuda_stream_pool

rmm::cuda_stream_pool provides fast access to a pool of CUDA streams. This class can be used to create a set of cuda_stream objects whose lifetime is equal to the cuda_stream_pool. Using the stream pool can be faster than creating the streams on the fly. The size of the pool is configurable. Depending on this size, multiple calls to cuda_stream_pool::get_stream() may return instances of rmm::cuda_stream_view that represent identical CUDA streams.

Thread Safety

All current device memory resources are thread safe unless documented otherwise. More specifically, calls to memory resource allocate() and deallocate() methods are safe with respect to calls to either of these functions from other threads. They are not thread safe with respect to construction and destruction of the memory resource object.

Note that a class thread_safe_resource_adapter is provided which can be used to adapt a memory resource that is not thread safe to be thread safe (as described above). This adapter is not needed with any current RMM device memory resources.

Allocators

C++ interfaces commonly allow customizable memory allocation through an Allocator object. RMM provides several Allocator and Allocator-like classes.

polymorphic_allocator

A stream-ordered allocator similar to std::pmr::polymorphic_allocator. Unlike the standard C++ Allocator interface, the allocate and deallocate functions take a cuda_stream_view indicating the stream on which the (de)allocation occurs.

stream_allocator_adaptor

stream_allocator_adaptor can be used to adapt a stream-ordered allocator to present a standard Allocator interface to consumers that may not be designed to work with a stream-ordered interface.

Example:

rmm::cuda_stream stream;
rmm::mr::polymorphic_allocator<int> stream_alloc;

// Constructs an adaptor that forwards all (de)allocations to `stream_alloc` on `stream`.
auto adapted = rmm::mr::stream_allocator_adaptor(stream_alloc, stream);

// Allocates 100 bytes using `stream_alloc` on `stream`
auto p = adapted.allocate(100);
...
// Deallocates using `stream_alloc` on `stream`
adapted.deallocate(p,100);

thrust_allocator

thrust_allocator is a device memory allocator that uses the strongly typed thrust::device_ptr, making it usable with containers like thrust::device_vector.

See below for more information on using RMM with Thrust.

Device Data Structures

device_buffer

An untyped, uninitialized RAII class for stream ordered device memory allocation.

Example

cuda_stream_view s{...};
// Allocates at least 100 bytes on stream `s` using the *default* resource
rmm::device_buffer b{100,s};
void* p = b.data();                   // Raw, untyped pointer to underlying device memory

kernel<<<..., s.value()>>>(b.data()); // `b` is only safe to use on `s`

rmm::mr::device_memory_resource * mr = new my_custom_resource{...};
// Allocates at least 100 bytes on stream `s` using the resource `mr`
rmm::device_buffer b2{100, s, mr};

device_uvector<T>

A typed, uninitialized RAII class for allocation of a contiguous set of elements in device memory. Similar to a thrust::device_vector, but as an optimization, does not default initialize the contained elements. This optimization restricts the types T to trivially copyable types.

Example

cuda_stream_view s{...};
// Allocates uninitialized storage for 100 `int32_t` elements on stream `s` using the
// default resource
rmm::device_uvector<int32_t> v(100, s);
// Initializes the elements to 0
thrust::uninitialized_fill(thrust::cuda::par.on(s.value()), v.begin(), v.end(), int32_t{0});

rmm::mr::device_memory_resource * mr = new my_custom_resource{...};
// Allocates uninitialized storage for 100 `int32_t` elements on stream `s` using the resource `mr`
rmm::device_uvector<int32_t> v2{100, s, mr};

device_scalar

A typed, RAII class for allocation of a single element in device memory. This is similar to a device_uvector with a single element, but provides convenience functions like modifying the value in device memory from the host, or retrieving the value from device to host.

Example

cuda_stream_view s{...};
// Allocates uninitialized storage for a single `int32_t` in device memory
rmm::device_scalar<int32_t> a{s};
a.set_value(42, s); // Updates the value in device memory to `42` on stream `s`

kernel<<<...,s.value()>>>(a.data()); // Pass raw pointer to underlying element in device memory

int32_t v = a.value(s); // Retrieves the value from device to host on stream `s`

host_memory_resource

rmm::mr::host_memory_resource is the base class that defines the interface for allocating and freeing host memory.

Similar to device_memory_resource, it has two key functions for (de)allocation:

  1. void* host_memory_resource::allocate(std::size_t bytes, std::size_t alignment)

    • Returns a pointer to an allocation of at least bytes bytes aligned to the specified alignment
  2. void host_memory_resource::deallocate(void* p, std::size_t bytes, std::size_t alignment)

    • Reclaims a previous allocation of size bytes pointed to by p.

Unlike device_memory_resource, the host_memory_resource interface and behavior is identical to std::pmr::memory_resource.

Available Host Resources

new_delete_resource

Uses the global operator new and operator delete to allocate host memory.

pinned_memory_resource

Allocates "pinned" host memory using cuda(Malloc/Free)Host.

Host Data Structures

RMM does not currently provide any data structures that interface with host_memory_resource. In the future, RMM will provide a similar host-side structure like device_buffer and an allocator that can be used with STL containers.

Using RMM with Thrust

RAPIDS and other CUDA libraries make heavy use of Thrust. Thrust uses CUDA device memory in two situations:

  1. As the backing store for thrust::device_vector, and
  2. As temporary storage inside some algorithms, such as thrust::sort.

RMM provides rmm::mr::thrust_allocator as a conforming Thrust allocator that uses device_memory_resources.

Thrust Algorithms

To instruct a Thrust algorithm to use rmm::mr::thrust_allocator to allocate temporary storage, you can use the custom Thrust CUDA device execution policy: rmm::exec_policy(stream).

thrust::sort(rmm::exec_policy(stream, ...);

The first stream argument is the stream to use for rmm::mr::thrust_allocator. The second stream argument is what should be used to execute the Thrust algorithm. These two arguments must be identical.

Logging

RMM includes two forms of logging. Memory event logging and debug logging.

Memory Event Logging and logging_resource_adaptor

Memory event logging writes details of every allocation or deallocation to a CSV (comma-separated value) file. In C++, Memory Event Logging is enabled by using the logging_resource_adaptor as a wrapper around any other device_memory_resource object.

Each row in the log represents either an allocation or a deallocation. The columns of the file are "Thread, Time, Action, Pointer, Size, Stream".

The CSV output files of the logging_resource_adaptor can be used as input to REPLAY_BENCHMARK, which is available when building RMM from source, in the gbenchmarks folder in the build directory. This log replayer can be useful for profiling and debugging allocator issues.

The following C++ example creates a logging version of a cuda_memory_resource that outputs the log to the file "logs/test1.csv".

std::string filename{"logs/test1.csv"};
rmm::mr::cuda_memory_resource upstream;
rmm::mr::logging_resource_adaptor<rmm::mr::cuda_memory_resource> log_mr{&upstream, filename};

If a file name is not specified, the environment variable RMM_LOG_FILE is queried for the file name. If RMM_LOG_FILE is not set, then an exception is thrown by the logging_resource_adaptor constructor.

In Python, memory event logging is enabled when the logging parameter of rmm.reinitialize() is set to True. The log file name can be set using the log_file_name parameter. See help(rmm.reinitialize) for full details.

Debug Logging

RMM includes a debug logger which can be enabled to log trace and debug information to a file. This information can show when errors occur, when additional memory is allocated from upstream resources, etc. The default log file is rmm_log.txt in the current working directory, but the environment variable RMM_DEBUG_LOG_FILE can be set to specify the path and file name.

There is a CMake configuration variable RMM_LOGGING_LEVEL, which can be set to enable compilation of more detailed logging. The default is INFO. Available levels are TRACE, DEBUG, INFO, WARN, ERROR, CRITICAL and OFF.

The log relies on the spdlog library.

Note that to see logging below the INFO level, the application must also set the logging level at run time. C++ applications must must call rmm::logger().set_level(), for example to enable all levels of logging down to TRACE, call rmm::logger().set_level(spdlog::level::trace) (and compile librmm with -DRMM_LOGGING_LEVEL=TRACE). Python applications must call rmm.set_logging_level(), for example to enable all levels of logging down to TRACE, call rmm.set_logging_level("trace") (and compile the RMM Python module with -DRMM_LOGGING_LEVEL=TRACE).

Note that debug logging is different from the CSV memory allocation logging provided by rmm::mr::logging_resource_adapter. The latter is for logging a history of allocation / deallocation actions which can be useful for replay with RMM's replay benchmark.

RMM and CUDA Memory Bounds Checking

Memory allocations taken from a memory resource that allocates a pool of memory (such as pool_memory_resource and arena_memory_resource) are part of the same low-level CUDA memory allocation. Therefore, out-of-bounds or misaligned accesses to these allocations are not likely to be detected by CUDA tools such as CUDA Compute Sanitizer memcheck.

Exceptions to this are cuda_memory_resource, which wraps cudaMalloc, and cuda_async_memory_resource, which uses cudaMallocAsync with CUDA's built-in memory pool functionality (CUDA 11.2 or later required). Illegal memory accesses to memory allocated by these resources are detectable with Compute Sanitizer Memcheck.

It may be possible in the future to add support for memory bounds checking with other memory resources using NVTX APIs.

Using RMM in Python

There are two ways to use RMM in Python code:

  1. Using the rmm.DeviceBuffer API to explicitly create and manage device memory allocations
  2. Transparently via external libraries such as CuPy and Numba

RMM provides a MemoryResource abstraction to control how device memory is allocated in both the above uses.

DeviceBuffer

A DeviceBuffer represents an untyped, uninitialized device memory allocation. DeviceBuffers can be created by providing the size of the allocation in bytes:

>>> import rmm
>>> buf = rmm.DeviceBuffer(size=100)

The size of the allocation and the memory address associated with it can be accessed via the .size and .ptr attributes respectively:

>>> buf.size
100
>>> buf.ptr
140202544726016

DeviceBuffers can also be created by copying data from host memory:

>>> import rmm
>>> import numpy as np
>>> a = np.array([1, 2, 3], dtype='float64')
>>> buf = rmm.DeviceBuffer.to_device(a.tobytes())
>>> buf.size
24

Conversely, the data underlying a DeviceBuffer can be copied to the host:

>>> np.frombuffer(buf.tobytes())
array([1., 2., 3.])

MemoryResource objects

MemoryResource objects are used to configure how device memory allocations are made by RMM.

By default if a MemoryResource is not set explicitly, RMM uses the CudaMemoryResource, which uses cudaMalloc for allocating device memory.

rmm.reinitialize() provides an easy way to initialize RMM with specific memory resource options across multiple devices. See help(rmm.reinitialize) for full details.

For lower-level control, the rmm.mr.set_current_device_resource() function can be used to set a different MemoryResource for the current CUDA device. For example, enabling the ManagedMemoryResource tells RMM to use cudaMallocManaged instead of cudaMalloc for allocating memory:

>>> import rmm
>>> rmm.mr.set_current_device_resource(rmm.mr.ManagedMemoryResource())

⚠️ The default resource must be set for any device before allocating any device memory on that device. Setting or changing the resource after device allocations have been made can lead to unexpected behaviour or crashes. See Multiple Devices

As another example, PoolMemoryResource allows you to allocate a large "pool" of device memory up-front. Subsequent allocations will draw from this pool of already allocated memory. The example below shows how to construct a PoolMemoryResource with an initial size of 1 GiB and a maximum size of 4 GiB. The pool uses CudaMemoryResource as its underlying ("upstream") memory resource:

>>> import rmm
>>> pool = rmm.mr.PoolMemoryResource(
...     rmm.mr.CudaMemoryResource(),
...     initial_pool_size=2**30,
...     maximum_pool_size=2**32
... )
>>> rmm.mr.set_current_device_resource(pool)

Other MemoryResources include:

  • FixedSizeMemoryResource for allocating fixed blocks of memory
  • BinningMemoryResource for allocating blocks within specified "bin" sizes from different memory resources

MemoryResources are highly configurable and can be composed together in different ways. See help(rmm.mr) for more information.

Using RMM with third-party libraries

Using RMM with CuPy

You can configure CuPy to use RMM for memory allocations by setting the CuPy CUDA allocator to rmm_cupy_allocator:

>>> from rmm.allocators.cupy import rmm_cupy_allocator
>>> import cupy
>>> cupy.cuda.set_allocator(rmm_cupy_allocator)

Note: This only configures CuPy to use the current RMM resource for allocations. It does not initialize nor change the current resource, e.g., enabling a memory pool. See here for more information on changing the current memory resource.

Using RMM with Numba

You can configure Numba to use RMM for memory allocations using the Numba EMM Plugin.

This can be done in two ways:

  1. Setting the environment variable NUMBA_CUDA_MEMORY_MANAGER:
$ NUMBA_CUDA_MEMORY_MANAGER=rmm.allocators.numba python (args)
  1. Using the set_memory_manager() function provided by Numba:
>>> from numba import cuda
>>> from rmm.allocators.numba import RMMNumbaManager
>>> cuda.set_memory_manager(RMMNumbaManager)

Note: This only configures Numba to use the current RMM resource for allocations. It does not initialize nor change the current resource, e.g., enabling a memory pool. See here for more information on changing the current memory resource.

Using RMM with PyTorch

PyTorch can use RMM for memory allocation. For example, to configure PyTorch to use an RMM-managed pool:

import rmm
from rmm.allocators.torch import rmm_torch_allocator
import torch

rmm.reinitialize(pool_allocator=True)
torch.cuda.memory.change_current_allocator(rmm_torch_allocator)

PyTorch and RMM will now share the same memory pool.

You can, of course, use a custom memory resource with PyTorch as well:

import rmm
from rmm.allocators.torch import rmm_torch_allocator
import torch

# note that you can configure PyTorch to use RMM either before or
# after changing RMM's memory resource.  PyTorch will use whatever
# memory resource is configured to be the "current" memory resource at
# the time of allocation.
torch.cuda.change_current_allocator(rmm_torch_allocator)

# configure RMM to use a managed memory resource, wrapped with a
# statistics resource adaptor that can report information about the
# amount of memory allocated:
mr = rmm.mr.StatisticsResourceAdaptor(rmm.mr.ManagedMemoryResource())
rmm.mr.set_current_device_resource(mr)

x = torch.tensor([1, 2]).cuda()

# the memory resource reports information about PyTorch allocations:
mr.allocation_counts
Out[6]:
{'current_bytes': 16,
 'current_count': 1,
 'peak_bytes': 16,
 'peak_count': 1,
 'total_bytes': 16,
 'total_count': 1}

Taking ownership of C++ objects from Python.

When interacting with a C++ library that uses RMM from Python, one must be careful when taking ownership of rmm::device_buffer objects on the Python side. The rmm::device_buffer does not contain an owning reference to the memory resource used for its allocation (only a device_async_resource_ref), and the allocating user is expected to keep this memory resource alive for at least the lifetime of the buffer. When taking ownership of such a buffer in Python, we have no way (in the general case) of ensuring that the memory resource will outlive the buffer we are now holding.

To avoid any issues, we need two things:

  1. The C++ library we are interfacing with should accept a memory resource that is used for allocations that are returned to the user.
  2. When calling into the library from python, we should provide a memory resource whose lifetime we control. This memory resource should then be provided when we take ownership of any allocated rmm::device_buffers.

For example, suppose we have a C++ function that allocates device_buffers, which has a utility overload that defaults the memory resource to the current device resource:

std::unique_ptr<rmm::device_buffer> allocate(
  std::size_t size,
  rmm::mr::device_async_resource_ref mr = get_current_device_resource())
{
    return std::make_unique<rmm::device_buffer>(size, rmm::cuda_stream_default, mr);
}

The Python DeviceBuffer class has a convenience Cython function, c_from_unique_ptr to construct a DeviceBuffer from a unique_ptr<rmm::device_buffer>, taking ownership of it. To do this safely, we must ensure that the allocation that was done on the C++ side uses a memory resource we control. So:

# Bad, doesn't control lifetime
buffer_bad = DeviceBuffer.c_from_unique_ptr(allocate(10))

# Good, allocation happens with a memory resource we control
# mr is a DeviceMemoryResource
buffer_good = DeviceBuffer.c_from_unique_ptr(
    allocate(10, mr.get_mr()),
    mr=mr,
)

Note two differences between the bad and good cases:

  1. In the good case we pass the memory resource to the allocation function.
  2. In the good case, we pass the same memory resource to the DeviceBuffer constructor so that its lifetime is tied to the lifetime of the buffer.

Potential pitfalls of relying on get_current_device_resource

Functions in both the C++ and Python APIs that perform allocation typically default the memory resource argument to the value of get_current_device_resource. This is to simplify the interface for callers. When using a C++ library from Python, this defaulting is safe, as long as it is only the Python process that ever calls set_current_device_resource.

This is because the current device resource on the C++ side has a lifetime which is expected to be managed by the user. The resources set by rmm::mr::set_current_device_resource are stored in a static std::map whose keys are device ids and values are raw pointers to the memory resources. Consequently, rmm::mr::get_current_device_resource returns an object with no lifetime provenance. This is, for the reasons discussed above, not usable from Python. To handle this on the Python side, the Python-level set_current_device_resource sets the C++ resource and stores the Python object in a static global dictionary. The Python get_current_device_resource then does not use rmm::mr::get_current_device_resource and instead looks up the current device resource in this global dictionary.

Hence, if the C++ library we are interfacing with calls rmm::mr::set_current_device_resource, the C++ and Python sides of the program can disagree on what get_current_device_resource returns. The only safe thing to do if using the simplified interfaces is therefore to ensure that set_current_device_resource is only ever called on the Python side.

rmm's People

Contributors

ajschmidt8 avatar ayodeawe avatar bdice avatar codereport avatar cwharris avatar dillon-cullinan avatar galipremsagar avatar germasch avatar gmarkall avatar gputester avatar harrism avatar jakirkham avatar jameslamb avatar jjacobelli avatar jrhemstad avatar kkraus14 avatar kylefromnvidia avatar madsbk avatar mike-wendt avatar miscco avatar pentschev avatar raydouglass avatar robertmaynard avatar rongou avatar sevagh avatar shwina avatar trevorsm7 avatar trxcllnt avatar vyasr avatar wence- avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

rmm's Issues

[BUG] Hang on 1TB allocation test with Managed Pool mode on DGX-1

Describe the bug
When RMM options are set to use pool allocations and use CUDA Managed Memory, the AllocateTB test hangs or runs for a very long time. I believe the cause is that cudaMallocManaged succeeds for a 1TB allocation when there is sufficient virtual system memory, but the subsequent cudaMemPrefetchAsync() runs for a long time.

Steps/Code to reproduce bug
Just run RMM_TEST on a DGX-1.

Expected behavior
It should return quickly, and the test should pass (potentially by correctly detecting an allocation failure, or by not prefetching if the allocation is larger than the gpu memory size).

Environment details (please complete the following information):

  • Environment location: Bare-metal
  • Method of RMM install: from source

[FEA] smart pointers (unique_ptr and shared_ptr) with custom deleters and device_buffer.

Is your feature request related to a problem? Please describe.
std::unique_ptr and std::shared_ptr support safer programming, but to use those with RMM, I need to define custom deleters that invoke RMM_FREE instead of C++'s default delete. Currently, every project using RMM should define its own, and this requires duplicated works.

Also, cudf currently has device_buffer and this provides a wrapper for an RMM memory block (similar to thrust::device_vector with RMM allocator but does not incur initialization overhead). Other projects can benefit from this as well, and I hope RMM provides this feature rather than every project reimplementing its own.

[BUG] Segmentation Fault inside `thrust::sort` when using pool allocation

Describe the bug
A segmentation fault occurs inside of the thrust::sort call inside of gdf_order_by of libcudf when RMM pool allocation is used.

Steps/Code to reproduce bug

from librmm_cffi import librmm_config as rmm_cfg
rmm_cfg.use_pool_allocator = True
import cudf
cudf._gdf.rmm_initialize()

df = cudf.DataFrame()
df['a'] = [1,2,3,4,5]
df['b'] = [5,4,3,2,1]
print(df.sort_values(['a']))

Environment details (please complete the following information):
Using branch-0.5 of cuDF 5aa1429f8305cfeb120aaa904d71dabfe785898d

Additional context
As you can see from the stack trace below, the error is occurring inside of a thrust::sort call that is attempting to use RMM to allocate a temporary buffer and using a non-null stream.

#1  0x00007fffe06b05a0 in cuEGLApiInit () from /usr/lib/x86_64-linux-gnu/libcuda.so
#2  0x00007fffe05c8555 in cuMemGetAttribute_v2 () from /usr/lib/x86_64-linux-gnu/libcuda.so
#3  0x00007fffe070f83f in cuStreamGetFlags () from /usr/lib/x86_64-linux-gnu/libcuda.so
#4  0x00007fffdf6e9ebf in ?? () from /usr/local/cuda-9.2/targets/x86_64-linux/lib/libcudart.so.9.2
#5  0x00007fffdf71231f in cudaStreamGetFlags () from /usr/local/cuda-9.2/targets/x86_64-linux/lib/libcudart.so.9.2
#6  0x00007fffdf964df9 in cnmem::Manager::setStream (this=0x21f65cf0, stream=0x7ffb50000600) at /home/jhemstad/RAPIDS/repro/cudf/cpp/thirdparty/rmm/thirdparty/cnmem/src/cnmem.cpp:392
#7  0x00007fffdf9643fe in cnmemRegisterStream (stream=0x7ffb50000600) at /home/jhemstad/RAPIDS/repro/cudf/cpp/thirdparty/rmm/thirdparty/cnmem/src/cnmem.cpp:1166
#8  0x00007fffdf95ef8e in rmm::Manager::registerStream (this=0x7fffdfb6e160 <rmm::Manager::getInstance()::instance>, stream=0x7ffb50000600) at /home/jhemstad/RAPIDS/repro/cudf/cpp/thirdparty/rmm/src/memory_manager.cpp:94
#9  0x00007fffc84791b1 in rmm::alloc<void> (ptr=0x7fffffffc2b0, size=767, stream=0x7ffb50000600, file=0x7fffc8b7c0e0 <_ZN3rmmL17RMM_USAGE_LOGGINGE+3889> "/home/jhemstad/RAPIDS/repro/cudf/cpp/thirdparty/rmm/include/rmm/thrust_rmm_allocator.h", line=48)
    at /home/jhemstad/RAPIDS/repro/cudf/cpp/thirdparty/rmm/include/rmm/rmm.hpp:133
#10 0x00007fffc84e6038 in rmm_allocator<char>::allocate (this=0x7fffffffcae0, n=767) at /home/jhemstad/RAPIDS/repro/cudf/cpp/thirdparty/rmm/include/rmm/thrust_rmm_allocator.h:48
#11 0x00007fffc84e5bdf in thrust::detail::allocator_traits<rmm_allocator<char> >::allocate(rmm_allocator<char>&, unsigned long)::workaround_warnings::allocate(rmm_allocator<char>&, unsigned long) (a=..., n=767)
    at /usr/local/cuda/targets/x86_64-linux/include/thrust/detail/allocator/allocator_traits.inl:230
#12 0x00007fffc84e5c05 in thrust::detail::allocator_traits<rmm_allocator<char> >::allocate (a=..., n=767) at /usr/local/cuda/targets/x86_64-linux/include/thrust/detail/allocator/allocator_traits.inl:234
#13 0x00007fffc84e4a59 in thrust::detail::get_temporary_buffer<char, rmm_allocator<char>, thrust::cuda_cub::execute_on_stream_base> (system=..., n=767) at /usr/local/cuda/targets/x86_64-linux/include/thrust/detail/execute_with_allocator.h:86
#14 0x00007fffc84e2f76 in thrust::get_temporary_buffer<char, thrust::detail::execute_with_allocator<rmm_allocator<char>, thrust::cuda_cub::execute_on_stream_base> > (exec=..., n=767)
    at /usr/local/cuda/targets/x86_64-linux/include/thrust/detail/temporary_buffer.h:62
#15 0x00007fffc84e24b3 in thrust::cuda_cub::get_memory_buffer<thrust::detail::execute_with_allocator<rmm_allocator<char>, thrust::cuda_cub::execute_on_stream_base> > (policy=..., n=767)
    at /usr/local/cuda/targets/x86_64-linux/include/thrust/system/cuda/detail/memory_buffer.h:57
#16 0x00007fffc84e1096 in thrust::cuda_cub::__merge_sort::merge_sort<thrust::detail::integral_constant<bool, false>, thrust::detail::integral_constant<bool, false>, thrust::cuda_cub::execution_policy<thrust::detail::execute_with_allocator<rmm_allocator<char>, thrust::cuda_cub::execute_on_stream_base> >, int*, int*, __nv_dl_wrapper_t<__nv_dl_tag<void (*)(void* const*, unsigned char* const*, int*, signed char*, unsigned long, unsigned long, bool, int*, bool, CUstream_st*), &(void multi_col_sort<int>(void* const*, unsigned char* const*, int*, signed char*, unsigned long, unsigned long, bool, int*, bool, CUstream_st*)), 2u>, LesserRTTI<int> > > (compare_op=..., items_first=0x0, keys_last=0x7ffb50000614, keys_first=0x7ffb50000600, policy=...)
    at /usr/local/cuda/targets/x86_64-linux/include/thrust/system/cuda/detail/sort.h:1336
#17 thrust::cuda_cub::__smart_sort::smart_sort<thrust::detail::integral_constant<bool, false>, thrust::detail::integral_constant<bool, false>, thrust::cuda_cub::execution_policy<thrust::detail::execute_with_allocator<rmm_allocator<char>, thrust::cuda_cub::execute_on_stream_base> >, int*, int*, __nv_dl_wrapper_t<__nv_dl_tag<void (*)(void* const*, unsigned char* const*, int*, signed char*, unsigned long, unsigned long, bool, int*, bool, CUstream_st*), &(void multi_col_sort<int>(void* const*, unsigned char* const*, int*, signed char*, unsigned long, unsigned long, bool, int*, bool, CUstream_st*)), 2u>, LesserRTTI<int> > > (compare_op=..., items_first=0x0, keys_last=0x7ffb50000614, keys_first=0x7ffb50000600, policy=...)
    at /usr/local/cuda/targets/x86_64-linux/include/thrust/system/cuda/detail/sort.h:1576
#18 thrust::cuda_cub::sort<thrust::detail::execute_with_allocator<rmm_allocator<char>, thrust::cuda_cub::execute_on_stream_base>, int*, __nv_dl_wrapper_t<__nv_dl_tag<void (*)(void* const*, unsigned char* const*, int*, signed char*, unsigned long, unsigned long, bool, int*, bool, CUstream_st*), &(void multi_col_sort<int>(void* const*, unsigned char* const*, int*, signed char*, unsigned long, unsigned long, bool, int*, bool, CUstream_st*)), 2u>, LesserRTTI<int> > > (policy=..., first=0x7ffb50000600, 
    last=0x7ffb50000614, compare_op=...) at /usr/local/cuda/targets/x86_64-linux/include/thrust/system/cuda/detail/sort.h:1653
#19 0x00007fffc84de58b in thrust::sort<thrust::detail::execute_with_allocator<rmm_allocator<char>, thrust::cuda_cub::execute_on_stream_base>, int*, __nv_dl_wrapper_t<__nv_dl_tag<void (*)(void* const*, unsigned char* const*, int*, signed char*, unsigned long, unsigned long, bool, int*, bool, CUstream_st*), &(void multi_col_sort<int>(void* const*, unsigned char* const*, int*, signed char*, unsigned long, unsigned long, bool, int*, bool, CUstream_st*)), 2u>, LesserRTTI<int> > > (exec=..., 
    first=0x7ffb50000600, last=0x7ffb50000614, comp=...) at /usr/local/cuda/targets/x86_64-linux/include/thrust/detail/sort.inl:56
#20 0x00007fffc84ddb1e in multi_col_sort<int> (d_cols=0x7ffb50000a00, d_valids=0x7ffb50000c00, d_col_types=0x7ffb50000e00, d_asc_desc=0x7ffb50000800 "", ncols=1, nrows=5, have_nulls=false, d_indx=0x7ffb50000600, nulls_are_smallest=false, stream=0x0)
    at /home/jhemstad/RAPIDS/repro/cudf/cpp/src/orderby/../sqls/sqls_rtti_comp.h:814
#21 0x00007fffc84daa21 in (anonymous namespace)::multi_col_order_by (cols=0x219e89e0, asc_desc=0x7ffb50000800 "", ncols=1, output_indices=0x21998850, flag_nulls_are_smallest=false) at /home/jhemstad/RAPIDS/repro/cudf/cpp/src/orderby/orderby.cu:57
#22 0x00007fffc84daae9 in gdf_order_by (cols=0x219e89e0, asc_desc=0x7ffb50000800 "", ncols=1, output_indices=0x21998850, flag_nulls_are_smallest=0) at /home/jhemstad/RAPIDS/repro/cudf/cpp/src/orderby/orderby.cu:88

[BUG] RMM_FREE of an invalid address is returning RMM_SUCCESS

If I try to rmm free an invalid address (note c_stream is 0):
err = RMM_FREE(reinterpret_cast<void*>(100), c_stream);

It prints this warning:
warning: Cuda API error detected: cudaFree returned (0x11)

but err is RMM_SUCCESS. I expected: RMM_ERROR_CUDA_ERROR.

This is a recent issue in branch-0.10, possibly related to #127.

This is some repro code, where we can't go into the if statement.

  cudaStream_t c_stream = reinterpret_cast<cudaStream_t>(0);
  rmmError_t err = RMM_FREE(reinterpret_cast<void*>(100), c_stream);
  if (err != RMM_SUCCESS) {
    std::cout <<"not successful free of invalid address" << err<<std::endl;
  }

[FEA] Function to get info on what's the largest chunk that can be safely allocated

Is your feature request related to a problem? Please describe.
rmmGetInfo gives information about the amount of free memory available. However, that can be an incorrect information in the light of fragmentation of the memory regions.

Describe the solution you'd like
rmmGetInfo should also give another variable as output which tells us what is the largest contiguous memory region available for allocation.

Describe alternatives you've considered
There are no alternatives to this. The way we have worked-around this issue is to expose a 'max-mem' parameter to our users and hope that they'll decide and pass the right amount that'll not cause OOM error down the line. This code can be seen here

Additional context
Since RMM wraps around cnmem, maybe this change should be done in cnmem itself. But I've filed this issue inside RMM, atleast to get the conversation started. Tagging cuML folks, JFYI: @JohnZed @dantegd @cjnolet

[FEA] Provide operator for bit-wise or of allocation modes

Problem: Bit-wise or-ing yields int not enum.

The API documentation implies that the allocation mode enums can bit bit-ored.
Example:

rmmOptions_t rmm_option {
  .allocation_mode = PoolAllocation | CudaManagedMemory,
  .initial_pool_size = free_memory / 2,
  .enable_logging = true };

gives a compiler error: error: a value of type "int" cannot be used to initialize an entity of type "rmmAllocationMode_t"

Suggestions:
a) Implement operator:

inline rmmAllocationMode_t operator|(rmmAllocationMode_t left, rmmAllocationMode_t right) {
   return static_cast<rmmAllocationMode_t>(
     static_cast<int>(left) | static_cast<int>(right));
}

or

b) add member PoolAllocationCudaManagedMemory = 3 to struct without bitwise or-ing.

[BUG] Allocation beyond initial pool size does not reuse freed memory

Describe the bug
After allocation pool size of X is consumed (and freed). New memory allocations (and frees) cause additional memory to be allocated from the GPU in increments of X.
Memory allocations/frees below the initial pool size X work fine until a new allocate goes above the initial size. This causes rmm to allocate a new chunk of memory on top of the initial pool size to accommodate the request. Caller frees all memory and requests new memory which again goes over the initial pool size. This causes rmm now to allocate yet another chunk of memory. The first extra chunk is not reused although it has been entirely been freed. There are now 3X of GPU memory allocated though < 2X memory has been requested. Continuing this pattern causes additional chunks of X memory until the GPU resources are used up.

Steps/Code to reproduce bug
Created simple test to show this problem here:
https://github.com/davidwendt/rmmtest/blob/master/explode.cu
The program allocates increasing memory 2 at time (each followed by 2 frees) and requests no more that 4GB total at any one time. Again, all memory is freed almost immediately after allocating.
With an initial pool size set to 4GB, this works well. The rmm allocates 4GB and never goes above.
With an initial pool size set to 2GB, rmm ends up allocating 24GB of GPU memory for the same code.
The intermediate new chunks of memory do not seem to be reused.

Expected behavior
Requesting memory beyond the initial pool size should be able to reuse freed memory in the new chunks.

Environment details (please complete the following information):

  • Environment location: created on Ubuntu 16.04 desktop
  • Method of RMM install: built from source -- repo for example above has cmake

rmmenv.txt

[FEA] Create a process so RMM can live as a global memory manager for multiple processes

Is your feature request related to a problem? Please describe.
As more and more people start using the ecosystem and building workloads using rapids.ai they will start spawning processes that are triggered by real time events, by a clock, by user interaction etc. We don't have a way of estimating usage of all of our algorithms (e.g. group by and join) but we DO know each time that cudf requests and allocation from rmm. Because the execution of these different workloads using rapids.ai is both unpredictable in terms of scheduling and memory consumption we can run into situations where we run out of resources not because any of the particular jobs requires more memory than can be provided but because the jobs can't be run at the same time.

Describe the solution you'd like

  1. To make a distinction between allocations which are in a temporary state (e.g. being used for calculations in a short term process (something that lives in seconds not minutes) and those which are long living (e.g. stuff we hand back to the user, things that we decide to store for longer than its just being processed on)
  2. To understand whether or not a job can begin and proceed according to if there is either enough memory available right now or will be soon because free the allocations in a temporary state would grant enough memory to make the allocation.
  3. To have all allocations originate in a single process which is able to keep track of how much has been allocated and in what state so that we can be able to fulfill requests even when perhaps enough memory is not available the moment the initial request was made so long as it can be fulfilled very soon when the memory is available.
  4. To hopefully maybe or maybe not allow allocations to be non blocking and and belong to a group of allocations which we can syncrhonize together and as a group. This would handle cases where an algorithm is going to make 4 allocations at one point and those allocations all need to exist at the same time. They can succeed or fail as a group and we can back out any allocations that were made if any in the group did not succeed.

Describe alternatives you've considered
Tracking allocations within our uses of cudf and adding a wrapper to the cudf python library that keeps track of memory as it comes in and out but I don't think this would really work.

Additional context
The code we currently have works great for demos and workloads that you are running one time. As people develop their toolsets they will run and run more workloads and it will not be possible to assume that these workloads are being queued to be run nor shoudl they be. I really think we should start considering possibilities for managing allocations across multiple processes. This could also allow us to be more aggressive with the size of the pool. Last piece of context is that I have not thought this through at length and this is just some stream of conscience ideas to help get a discussion going.

[BUG] gc.collect() needs to be called before rmm.finalize() in test_rmm.py

Describe the bug

test_rmm.py serves as a reference to test code with different RMM configurations.

 38 # Test all combinations of default/managed and pooled/non-pooled allocation
 39 @pytest.mark.parametrize('managed, pool',
 40                          list(product([False, True], [False, True])))
 41 def test_rmm_modes(managed, pool):
 42     rmm.finalize()
 43     rmm_cfg.use_managed_memory = managed
 44     rmm_cfg.use_pool_allocator = pool
 45     rmm.initialize()
 46 
 47     assert(rmm.is_initialized())
 48 
 49     array_tester(np.int32, 128)

array_tester creates objects holding GPU memory. Calling rmm.finalize() before these objects are destroyed can lead to memory corruption; this can lead to undefined behaviors. Calling gc.collect() (before rmm.finalize()) triggers objects with 0 reference count to be deleted (and release GPU memory) to avoid memory corruption.

[BUG] Error when using RMM: parallel_for failed: out of memory

Environment details (please complete the following information):

  • Environment location: Docker
  • Method of RMM install: Docker
    • Docker pull: docker pull rapidsai/rapidsai-dev:0.9-cuda10.0-devel-ubuntu16.04-py3.7
    • Docker run: docker run --runtime=nvidia --rm -it --net=host -p 8888:8888 -p 8787:8787 -p 8786:8786 -v /home/rapids/notebooks-extended/:/rapids/notebooks/extended/ -v /home/rapids/data/:/home/rapids/data/ rapidsai/rapidsai-dev:0.9-cuda10.0-devel-ubuntu16.04-py3.7

Describe the bug
I am using the Jupyter notebook NYCTaxi-E2E.ipynb and have added the RMM functionality; however, the system crashes at the XGBoost training step. See below the error:

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  parallel_for failed: out of memory

Steps/Code to reproduce bug
Added methods:

def initialize_rmm_pool():
    rmm_cfg.use_pool_allocator = True
    return cudf.rmm.initialize()

def initialize_rmm_no_pool():
    rmm_cfg.use_pool_allocator = False
    return cudf.rmm.initialize()

def finalize_rmm():
    return cudf.rmm.finalize()

[FEA] Provide a `uninitialized_vector` that doesn't initialize the allocation

Is your feature request related to a problem? Please describe.
rmm::device_vector is an alias for a thrust::device_vector that uses RMM as the allocator. By default, thrust::device_vector will invoke the default constructor for each element in the vector. This is oftentimes unnecessary overhead as it requires invoking a kernel to initialize the elements of the vector.

Describe the solution you'd like
Provide rmm::uninitialized_device_vector that simply allocates the memory of the specified size and sets the .size() appropriately.

See https://github.com/thrust/thrust/blob/master/examples/uninitialized_vector.cu for reference.

[FEA] Support pooled memory manager for multiple devices

Is your feature request related to a problem? Please describe.
Currently, the memory manager is a singleton class which means all devices share the same pool.

Describe the solution you'd like
Ideally, we can create a memory manager per device or pass in a device paramter in the RMMMalloc/RMMFree call.

[DOC] RMM headers don't specify alignment of allocations

(I'm using the RMM in rapidsai/cudf/branch-0.5.)

It seems the Doxygen comments, such as they are for alloc(), for RMM_ALLOC() and other relevant function do not indicate whether allocations are aligned and to what degree they are.

[BUG]Wrong order of LogIt class private variables

Describe the bug
If rmm is used with a libary with -Werror then the compilation fails with the following message :

/home/aatish/workspace/cuhornet/hornet/../externals/rmm/include/rmm/rmm.hpp: In constructor ‘rmm::LogIt::LogIt(rmm::Logger::MemEvent_t, void*, size_t, cudaStream_t, const char*, unsigned int, bool)’:
/home/aatish/workspace/cuhornet/hornet/../externals/rmm/include/rmm/rmm.hpp:101:8: error: ‘rmm::LogIt::usageLogging’ will be initialized after [-Werror=reorder]
   bool usageLogging;
        ^~~~~~~~~~~~
/home/aatish/workspace/cuhornet/hornet/../externals/rmm/include/rmm/rmm.hpp:100:16: error:   ‘unsigned int rmm::LogIt::line’ [-Werror=reorder]
   unsigned int line;
                ^~~~
/home/aatish/workspace/cuhornet/hornet/../externals/rmm/include/rmm/rmm.hpp:59:3: error:   when initialized here [-Werror=reorder]
   LogIt(Logger::MemEvent_t event, void* ptr, size_t size, cudaStream_t stream,
   ^~~~~
cc1plus: all warnings being treated as errors

This can be replicated with branch-0.10

[BUG] RMM logging is slow.

Describe the bug
The RMM log is slow. It was written quickly to get something working but the overhead of using STL for a log is too high and therefore it is off by default.

Steps/Code to reproduce bug
Turn on logging in a big app with a lot of alloc/free (e.g. RAPIDS E2E workflow) and see how much it slows down.

Expected behavior
Fast.

[DOC] Update README to include conda install

Report incorrect documentation

Hi!

README.md file mentions RMM can only be installed via source code.

Nevertheless, I have found the following conda packages:

https://anaconda.org/rapidsai/rmm

https://anaconda.org/rapidsai/librmm

I am wondering if the README.md file is up-to-date. If not, it should be great to update it mentioning conda installs.

Location of incorrect documentation
README.md in master branch.

https://github.com/rapidsai/rmm#install-rmm

Describe the problems or issues found in the documentation
(detailed above)

[BUG] rmm_allocator::deallocate(): RMM_FREE: initialization error

For Python classes that wrap C++ classes that contain memory allocated by RMM, when the python process ends the Python order of de-allocation may cause an RMM_FREE initialization error when using the pool allocator. This occurs because RMM instance may have been destroyed before the Python class. This error causes the python process to terminate with a core dump instead of cleanly exiting.

Simple testcase to show the error from a python command-line interpretter:

>>> from librmm_cffi import librmm as rmm
>>> from librmm_cffi import librmm_config as rmm_cfg
>>> rmm_cfg.use_pool_allocator = True 
>>> rmm.initialize()
0
>>> import nvstrings
>>> strs = nvstrings.to_device(["hello"])
>>> exit()

Before the process ends cleanly the following exception occurs terminating the process:

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  rmm_allocator::deallocate(): RMM_FREE: cudaErrorInitializationError: initialization error
Aborted

This particular error is thrown in

inline void deallocate(pointer ptr, size_t)

    inline void deallocate(pointer ptr, size_t)
    {
      rmmError_t error = RMM_FREE(thrust::raw_pointer_cast(ptr), stream);
  
      if(error != RMM_SUCCESS)
      {
        throw thrust::system_error(error, thrust::cuda_category(), "rmm_allocator::deallocate(): RMM_FREE");
      }
    }

The nvstrings instance points to a C++ NVStrings instance which has a member variable allocated with rmm::device_vector and this vector is freed after RMM is deinitialized by the python process.

Throwing the error in the rmm::device_vector destructor causes the process to terminator (core dump).

Propose checking for this condition (free called after deinit) inside of RMM_FREE or rmm::free and ignoring this error since the memory has already been freed and no corruption will occur.

[FEA] Add multi-device and multi-threaded test(s)

Is your feature request related to a problem? Please describe.
RMM supports multi-device allocation, and is thread safe. But we don't have tests of either these.

Describe the solution you'd like
Add tests for allocation on multiple devices. Add multi-threaded single-device and multi-device tests.

[BUG] RMM pytests failing flake8 style checks

Describe the bug
Now that CI is being added, flake8 is finding minor style problems, for example in rmm_tests.py

Steps/Code to reproduce bug
Run flake8 python from the root RMM directory.

Expected behavior
No errors

[BUG] MemoryTest assumes exclusive use of GPU

Describe the bug

The unit test MemoryTest.GetInfo tests that the memory available on the GPU goes down after a successful allocation.

https://github.com/rapidsai/rmm/blob/branch-0.7/tests/memory_tests.cpp#L190

It uses the rmmGetInfo API, which in non-pool mode calls cudaMemGetInfo which queries the entire device's memory usage. This isn't resilient to other processes using the GPU, as another process may free a large portion of memory causing the total device memory to go down, causing this test to fail:

04:29:21 [ RUN      ] MemoryManagerTest/2.GetInfo
04:29:21 /rapids/cudf/cpp/thirdparty/rmm/tests/memory_tests.cpp:207: Failure
04:29:21 Expected: (freeAfter) <= (freeBefore), actual: 20142030848 vs 20114767872
04:29:21 [  FAILED  ] MemoryManagerTest/2.GetInfo, where TypeParam = ModeType<(rmmAllocationMode_t)2> (3 ms)

I believe this test could be made more resilient to GPU sharing by using the NVML API nvmlDeviceGetComputeRunningProcesses. This allows you to query the GPU memory usage of each process using the GPU. In this way, the test can be refactored to ensure that the memory used by the calling process grows as a result of the allocation.

Expected behavior
Unit tests should be resilient to multiple processes using the GPU.

[FEA] Host Doxygen HTML

Is your feature request related to a problem? Please describe.
The libcudf Doxygen documentation HTML page should be accessible without requiring someone to clone the repo and build with make doc.

Describe the solution you'd like
Doxygen HTML should be hosted and accessible via github or dev docs page.

[DOC] Document logging

Report incorrect documentation

Location of incorrect documentation
README.md has not explanation of logging and how to use it from C++ or Python

Describe the problems or issues found in the documentation
README.md has not explanation of logging and how to use it from C++ or Python

Suggested fix for documentation
Add explanation and usage examples of logging to README.md

[BUG] Including rmm/rmm.h without cmake, make, make install fails

Describe the bug
If a project includes rmm/rmm.h without doing cmake, make, make install compilation fails with

rmm/include/rmm/detail/memory_manager.hpp:37:30: fatal error: rmm/detail/cnmem.h: No such file or directory

This does not happen if the include of cnmem.h in memory_manager.hpp:37 is done via #include "cnmem.h" instead of #include "rmm/detail/cnmem.h". For projects which have header only dependency to rmm the cmake, make, make install step is not necessary so it would be desriable if this works.

Steps/Code to reproduce bug
Compiling

#include <rmm/rmm.h>

int main()
{
        return 0;
}

with

g++ -I$CUDA_HOME/include -Irmm/include rmm_include_bug.cpp

reproduces the error

In file included from rmm/include/rmm/rmm.hpp:28:0,
                 from rmm/include/rmm/rmm.h:5,
                 from rmm_include_bug.cpp:1:
rmm/include/rmm/detail/memory_manager.hpp:37:10: fatal error: rmm/detail/cnmem.h: No such file or directory
 #include "rmm/detail/cnmem.h"
          ^~~~~~~~~~~~~~~~~~~~
compilation terminated.

Expected behavior
Compilation of the above example works.

Environment details:

  • Environment location: Bare-metal
  • Method of RMM install: from source
  • Output of print_env.sh attached as rmm_print_env.log

[FEA] Replace CFFI bindings with Cython

We should transition the current Python bindings API to Cython in order

  • to support promoting C++ exceptions up to Python exceptions for better error reporting
  • to match the approach cuDF is taking

[FEA] RMM_TRY and RMM_TRY_CUDAERROR

Is your feature request related to a problem? Please describe.
cudf has RMM_TRY and RMM_TRY_CUDAERROR. Other projects using RMM often need to redefine RMM_TRY and RMM_TRY_CUDAERROR. It will be better if RMM provides these macros.

And we may need no throw version (e.g. something like RMM_TRY_NOTHROW) mainly for class destructors and for RMM_FREE; class destructors are noexcept by default and if RMM_FREE with an erroneous parameter results in undefined behavior similar to std::free (https://en.cppreference.com/w/cpp/memory/c/free), it's better to crash the program after printing error than continue execution with undefined behavior.

[FEA] Provide a `device_vector`-like abstraction that can accept streams

Is your feature request related to a problem? Please describe.

rmm::device_vector is currently a simple alias for a thrust::device_vector with a rmm_allocator<T> used as it's allocator template argument. This allocator always uses the null stream for memory allocation, and there is no way for users to modify this behavior.

As seen in rapidsai/cudf#2631, this is problematic.

Describe the solution you'd like

RMM should provide an improved device_vector abstraction. It cannot simply be just a type alias as it requires specifying constructor arguments that thrust::device_vector does not currently support(*). However, we can avoid fully reinventing the wheel by inheriting from a thrust::device_vector and adding the new necessary constructors.

It should be built to also accept a device_memory_resource to support the new memory resource design.

(*)Thrust in CUDA 10.1 added passing allocators as a function argument, however, that does not fully solve this issue. First of all, we cannot assume all users of RMM can use CUDA 10.1. Second of all, this still does not allow simply specifying a stream in a constructor argument.

Additional context
Add any other context, code examples, or references to existing implementations about the feature request here.

[FEA] Configure and start building Doxygen HTML documentation

Is your feature request related to a problem? Please describe.
RMM should build the HTML Doxygen documentation from its in-line comments.

Describe the solution you'd like
Add a Doxyfile with configuration options necessary to build the RMM Doxygen HTML documentation.

Ideally, the HTML documentation should then be made available on the web without requiring individuals to build it themselves.

Related: rapidsai/cudf#698

RMM Memory Leak after running for a while [QST]

What is your question?
AresDB integrated with RMM last week and tried to run it under staging for a while.
We used pooled memory management and default stream for memory allocation.

After 30 minutes, it seems all memory of one GPU card is exhausted and a segmentation fault happens in next memory allocation.

I don't think there are any memory leaks in our code since previously when we call cudaMalloc/cudaFree, it works.

Here is the link to our code
https://github.com/uber/aresdb/blob/master/memutils/memory/rmm_alloc.cu
Thank you so much!

[FEA] Multi-GPU support (single node)

Related issue: #66

Is your feature request related to a problem? Please describe.

I wish I could use RMM for a multi-GPU node. However, it may not be possible in the current implementation if I enable pool allocation.

 54 // Initialize memory manager state and storage.
 55 rmmError_t rmmInitialize(rmmOptions_t *options)
 56 {
 57     rmm::Manager::getInstance().initialize(options);
 58 
 59     if (rmm::Manager::usePoolAllocator())
 60     {
 61         cnmemDevice_t dev;
 62         RMM_CHECK_CUDA( cudaGetDevice(&(dev.device)) );
 63         // Note: cnmem defaults to half GPU memory
 64         dev.size = rmm::Manager::getOptions().initial_pool_size;
 65         dev.numStreams = 1;
 66         cudaStream_t streams[1]; streams[0] = 0;
 67         dev.streams = streams;
 68         dev.streamSizes = 0;
 69         unsigned flags = rmm::Manager::useManagedMemory() ? CNMEM_FLAGS_MANAGED : 0;
 70         RMM_CHECK_CNMEM( cnmemInit(1, &dev, flags) );
 71     }
 72     return RMM_SUCCESS;
 73 }

rmmInitialize() calls cnmemInit in line 70 with numDevices set to 1.

1071 cnmemStatus_t cnmemInit(int numDevices, const cnmemDevice_t *devices, unsigned flags) {
1072     // Make sure we have at least one device declared.
1073     CNMEM_CHECK_TRUE(numDevices > 0, CNMEM_STATUS_INVALID_ARGUMENT);
1074 
1075     // Find the largest ID of the device.
1076     int maxDevice = 0;
1077     for( int i = 0 ; i < numDevices ; ++i ) {
1078         if( devices[i].device > maxDevice ) {
1079             maxDevice = devices[i].device;
1080         }
1081     }
1082 
1083     // Create the global context.
1084     cnmem::Context::create();
             ...

cnmemInit() calls cnmem::Context::create() in line 1084 and

1024 cnmemStatus_t Context::create() {
1025     sCtx = new Context;
1026     sCtxCheck = CTX_VALID;
1027     return CNMEM_STATUS_SUCCESS;
1028 }

create() resets the Context class's static member variable sCtx to a newly created Context object in sCtx.

So, if I call rmmInitialize() multiple times (after cudaSetDevice(), once per device), only the last call will have effect (besides memory leaks for previously allocated Context objects).

rmmInitialize() does not take num_devices as cnmemInit, so I cannot initialize RMM for multiple devices in a single rmmInitialize() call, either.

Describe the solution you'd like
Need a mechanism to initialize RMM for multiple devices (in cnmem style or by calling rmmInitialize multiple times after cudaSetDevice).

[FEA] Add a pinned memory resource

Is your feature request related to a problem? Please describe.

There's currently no memory resource for allocating pinned memory (e.g., cudaHostAlloc).

Describe the solution you'd like

There should be a pinned_memory_resource.

Additional context
Inspired by rapidsai/cudf#2872 (comment)

[BUG] "__global__ function call is not configured" <- What does this mean?

I'm developing a feature over cudf branch-0.6; using rmm changeset dfe2c4b . At some point, I'm getting this error:

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  rmm_allocator::deallocate(): RMM_FREE: __global__ function call is not configured
Aborted

This is a problematic exception. Regardless of the reason this happened to me - "system error" is very general, and a typical user of rmm cannot understand what the what() message means.

So, please rewrite the code producing this what() message to:

  • Target people who don't know anything about rmm internals.
  • Be more specific in what exactly happened that should not have.
  • Describe something which should have been done differently, or was skipped etc. - to give the user a hint regarding how to avoid this exception.
  • Add some specifics in addition to the canned string (e.g. address, size, other parameters, other state, or some of the above).

[FEA] Add release() function to pool resources

Is your feature request related to a problem? Please describe.
Calling rmmFinalize should deallocate the memory pool in any of the pool resources. Currently the only way to free a pool is when the pool resource is destroyed at the end of the application.

Describe the solution you'd like
The pool resources need release methods added to free their memory pools. For example, see std::pmr::synchronized_pool::release().

Additional context
When rmmFinalize is invoked, how do we know what resources to call release on? release is not a member of the device_memory_resource base class, so it's not possible to call get_default_resource()->release(). Do we just always call pool_resource()->release() and managed_pool_resource()->release()? But that will end up constructing those resources only to then release them.

[BUG] Remove deprecated #define _BSD_SOURCE from random_allocate benchmark

Describe the bug
random_allocate.cpp includes the line #define _BSD_SOURCE which is deprecated in newer versions of GCC and causes -Werror compilation to fail.

Steps/Code to reproduce bug
Fails to compile on Linux Ubuntu 18.04 L4T kernel

g++ (Ubuntu/Linaro 7.3.0-27ubuntu1~18.04) 7.3.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Additional context
Trying to build on Jetson Xavier

Note that the fix is easy: remove that line as it seems unnecessary.

[QST] Hard limit on the amount of data rmm can allocate at once?

What is your question?
In the below code snippet the last line d_pos=pos transfers data across main memory and GPU memory. I'm having issues with this line when vectors hold data with size over a certain a threshold (around 30GB). Is there a hard limit of data rmm can allocate/move at once?

void initDataset (std::vector<float> *pos, size_t x, size_t y, size_t z)
{
    int i,j,k;
double Pe;
std::mt19937 rng(time(NULL));
std::uniform_real_distribution<float> gen(-4.0, 0.0);
for (i=-(int)x/2;i<((int)x/2);++i)
{
	for (j=-(int)y/2;j<((int)y/2);++j)
	{
		for (k=0;k<z;++k)
		{
			Pe = gen(rng);
			pos->push_back(i);
			pos->push_back(j);
			pos->push_back(k);
			pos->push_back(Pe);
		}
	}
}
}

int main (int argc, char *argv[])
     {
unsigned int i, iter = 30;
size_t sx = 400, sy = 400, sz = 2000;
size_t numParticles = 0;
std::vector<float> pos; // particle positions

rmm::device_vector<float> d_pos; // particle positions in GPU
rmm::device_vector<float> d_posOut; // particle positions out in GPU

// This willl be used to generate plane's normals randomly
// between -1 to 1
std::mt19937 rng(time(NULL));
std::uniform_real_distribution<float> gen(-1.0, 1.0);
numParticles = sx*sy*sz;

    // Types of allocations:
    // CudaDefaultAllocation
    // PoolAllocation
    // CudaManagedMemory

rmmOptions_t options{rmmAllocationMode_t::PoolAllocation, 0, true};
rmmInitialize(&options);

initDataset(&pos, sx, sy, sz);

// plane defined by normal and D
float normal[3], d = 0.0f;


for (i=0;i<iter;i++)
{
	// Generating plane's normals randomly
	// between -1 to 1
	normal[0] = gen(rng);
	normal[1] = gen(rng);
	normal[2] = gen(rng);

	timer.reset();
	d_pos = pos;
   .... 

[FEA] Provide a way to query the initialized state of RMM

Is your feature request related to a problem? Please describe.
There is currently no way to query whether or not RMM has been initialized and if so, what options were used.
Describe the solution you'd like
Provide an API for querying initialization state of RMM, e.g. bool rmm::is_initialized(rmmOptions_t *options), which would return true or false and if true return the options struct filled out.

Describe alternatives you've considered
I have also considered separating the Boolean state and the options in separate queries, but I think allowing nullptr as a valid value for options satisfies both use cases.

Additional context
This is necessary for interoperation of multiple modules / libraries that all need to use RMM without re-initializing it.

[FEA] Configure RMM CMake to build CUDA files

Is your feature request related to a problem? Please describe.
For some tests, I would like to be able to compile/run kernels and Thrust functions. However, I cannot build any .cu files using RMM's existing cmake configuration.

Describe the solution you'd like
Update RMM's cmake configuration to allowing build .cu files.

[BUG] - files placed in the wrong directory as part of RMM installation.

I used make install and it does indeed copy files to a location that I specify or the default location (which was /usr/local/ on my system. The header files are placed there in include/include/. If I then I try to do #include <include/memory.h>', it fails as the other files are not set to the include path. It would make sense to either make a directory called rmmwithin include and be sure that all the header files within rmm also look for files within that directory, or simply do not place the files intousr/local/include/includebut rather tousr/local/include/.

[BUG] rmm.finalize not releasing memory (pool mode)

Describe the bug
Calling rmm.finalize() after rmm has been initialized in pool mode should/used to free up the memory pool. This no longer happens.

Steps/Code to reproduce bug

import rmm
from rmm import rmm_config as rmm_cfg

rmm_cfg.use_pool_allocator = True
rmm.initialize()

Pool allocated with 1/2 the cpu memory

 NVIDIA-SMI 410.104      Driver Version: 410.104      CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:3B:00.0 Off |                    0 |
| N/A   38C    P0    27W /  70W |   7669MiB / 15079MiB |      0%      Default 
rmm.finalize()

Gpu memory usage is still 1/2 gpu memory.

| NVIDIA-SMI 410.104      Driver Version: 410.104      CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:3B:00.0 Off |                    0 |
| N/A   43C    P0    28W /  70W |   7669MiB / 15079MiB |      0%      Default |

Expected behavior
Gpu memory should be freed

Environment details (please complete the following information):

  • Environment location: Conda
  • Method of RMM install: Conda install nightly rmm=0.10.*
    librmm-0.10.0a191007
    rmm-0.10.0a191007
  • Please run and attach the output of the rmm/print_env.sh script to gather relevant environment details

Additional context
Add any other context about the problem here.

[FEA] Add array shape and order options for device_array_from_ptr

Is your feature request related to a problem? Please describe.

Currently device_array_from_ptr in librmm_cffi/wrapper.py assumes 1D array. Some cuML algorithms return higher dimensional arrays, and we need to wrap them as DeviceNDArray.

Describe the solution you'd like
Add shape and order options to wrap multi dimensional device arrays.

Here is an implementation from cuML SVM:
https://github.com/tfeher/cuml/blob/97d2c00d538a2799db7b42b584b8006aee1633ed/python/cuml/utils/numba_utils.py#L145-L185

Rename RMM's header memory.h

The name of the RMM header memory.h clashes with STL or C standard header names, creating build issues ('extern "C"' causing mangling issues when linking being one of the harmful consequences). Please consider renaming to a non-standard header name (e.g., rmm_memory.h).

[FEA] Supporting cuDF Series in device_array_like

Is your feature request related to a problem? Please describe.

I'd like to run the following code.

from librmm_cffi import librmm as rmm
import cudf

s = cudf.Series([0, 1, 2])
a = rmm.device_array_like(s)

Currently this fails with the following error.

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-3-917a74c3463e> in <module>
----> 1 rmm.device_array_like(s)

~/miniconda/envs/rapids9/lib/python3.7/site-packages/librmm_cffi/wrapper.py in device_array_like(self, ary, stream)
    227             ary = ary.reshape(1)
    228 
--> 229         return self.device_array(ary.shape, ary.dtype, ary.strides,
    230                                  stream=stream)
    231 

AttributeError: 'Series' object has no attribute 'strides'

Describe the solution you'd like

It would be great if rmm.device_array_like worked with Series objects. No strong feelings about how that is accomplished.

Describe alternatives you've considered

We could special case handling of Series objects, but this shifts the burden to other libraries to solve this problem.

Alternatively cuDF Series objects could gain a strides attribute. This could be reasonable.

Additional context

This came up when trying to better handle GPU array-like objects in cuML ( rapidsai/cuml#1086 ), which is part of the Grid Search effort.

Edit: More specifically, we tried to use librmm_cffi.librmm.to_device instead of numba.cuda.to_device, but were unable to as Series are not supported.

[DOC] - RMM will not capture Out-of-Bound segfaults in pool mode

When using RMM in pool mode, a problem could arise that out-of-bound memory segfaults will go undetected as the out-of-bound memory access will be within the bounds of the pre-allocated memory pool.

To avoid this, it is highly recommended that when developing code that the non-pool version of RMM be used until correctness has been verified at which case the pool can be used to improve performance.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.