rapidsai / kvikio Goto Github PK

View Code? Open in Web Editor NEW

118.0 118.0 45.0 1.11 MB

KvikIO - High Performance File IO

Home Page: https://docs.rapids.ai/api/kvikio/stable/

License: Apache License 2.0

CMake 4.46% C++ 27.19% Python 27.92% Cython 16.66% Shell 3.43% Jupyter Notebook 19.73% HTML 0.48% Dockerfile 0.13%

kvikio's People

Contributors

Stargazers

Watchers

kvikio's Issues

[Bug] Please enhance instruction for building Python package

Please update instruction or build script to make it work on any CUDA toolkit-enabled system.

Couldn't find include header

❯ pyenv activate cucim-3.8
❯ cd python
❯ python -m pip install .
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Processing /home/gbae/repo/kvikio/python
  DEPRECATION: A future pip version will change local packages to be built in-place without first copying to a temporary directory. We recommend you use --use-feature=in-tree-build to test your packages with this new behavior before it becomes the default.
   pip 21.3 will remove support for this functionality. You can find discussion regarding this at https://github.com/pypa/pip/issues/7555.
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
    Preparing wheel metadata ... done
Collecting cython
  Downloading Cython-0.29.28-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (1.9 MB)
     |████████████████████████████████| 1.9 MB 1.1 MB/s 
Building wheels for collected packages: kvikio
  Building wheel for kvikio (PEP 517) ... error
  ERROR: Command errored out with exit status 1:
   command: /home/gbae/.pyenv/versions/cucim-3.8/bin/python /home/gbae/.pyenv/versions/cucim-3.8/lib/python3.8/site-packages/pip/_vendor/pep517/in_process/_in_process.py build_wheel /tmp/tmp7ibfglvk
       cwd: /tmp/pip-req-build-d06p8fxv
  Complete output (27 lines):
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build/lib.linux-x86_64-3.8
  creating build/lib.linux-x86_64-3.8/kvikio
  copying kvikio/cufile.py -> build/lib.linux-x86_64-3.8/kvikio
  copying kvikio/zarr.py -> build/lib.linux-x86_64-3.8/kvikio
  copying kvikio/thread_pool.py -> build/lib.linux-x86_64-3.8/kvikio
  copying kvikio/_version.py -> build/lib.linux-x86_64-3.8/kvikio
  copying kvikio/__init__.py -> build/lib.linux-x86_64-3.8/kvikio
  creating build/lib.linux-x86_64-3.8/kvikio/_lib
  copying kvikio/_lib/__init__.py -> build/lib.linux-x86_64-3.8/kvikio/_lib
  copying kvikio/_lib/arr.pyi -> build/lib.linux-x86_64-3.8/kvikio/_lib
  UPDATING build/lib.linux-x86_64-3.8/kvikio/_version.py
  set build/lib.linux-x86_64-3.8/kvikio/_version.py to '0+unknown'
  running build_ext
  building 'kvikio._lib.libkvikio' extension
  creating build/temp.linux-x86_64-3.8
  creating build/temp.linux-x86_64-3.8/kvikio
  creating build/temp.linux-x86_64-3.8/kvikio/_lib
  gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O3 -Wall -I/home/linuxbrew/.linuxbrew/opt/zlib -I/home/linuxbrew/.linuxbrew/opt/zlib -fPIC -O3 -I/home/gbae/.pyenv/versions/3.8.12/include -I/usr/local/cuda/include -I/home/gbae/.pyenv/versions/cucim-3.8/include -I/home/gbae/.pyenv/versions/3.8.12/include/python3.8 -c kvikio/_lib/libkvikio.cpp -o build/temp.linux-x86_64-3.8/kvikio/_lib/libkvikio.o -std=c++17
  kvikio/_lib/libkvikio.cpp:752:10: fatal error: kvikio/utils.hpp: No such file or directory
    752 | #include <kvikio/utils.hpp>
        |          ^~~~~~~~~~~~~~~~~~
  compilation terminated.
  error: command '/usr/bin/gcc' failed with exit code 1
  ----------------------------------------
  ERROR: Failed building wheel for kvikio
Failed to build kvikio
ERROR: Could not build wheels for kvikio which use PEP 517 and cannot be installed directly
WARNING: You are using pip version 21.1.1; however, version 22.0.3 is available.
You should consider upgrading via the '/home/gbae/.pyenv/versions/cucim-3.8/bin/python -m pip install --upgrade pip' command.

Couldn't find libnvidia-ml.so

❯ cd python
❯ python -m pip install .
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Processing /home/gbae/repo/kvikio/python
  DEPRECATION: A future pip version will change local packages to be built in-place without first copying to a temporary directory. We recommend you use --use-feature=in-tree-build to test your packages with this new behavior before it becomes the default.
   pip 21.3 will remove support for this functionality. You can find discussion regarding this at https://github.com/pypa/pip/issues/7555.
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
    Preparing wheel metadata ... done
Collecting cython
  Downloading Cython-0.29.28-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (1.9 MB)
     |████████████████████████████████| 1.9 MB 679 kB/s 
Building wheels for collected packages: kvikio
  Building wheel for kvikio (PEP 517) ... error
  ERROR: Command errored out with exit status 1:
   command: /home/gbae/.pyenv/versions/cucim-3.8/bin/python /home/gbae/.pyenv/versions/cucim-3.8/lib/python3.8/site-packages/pip/_vendor/pep517/in_process/_in_process.py build_wheel /tmp/tmpio38_f81
       cwd: /tmp/pip-req-build-nnr151e7
  Complete output (62 lines):
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build/lib.linux-x86_64-3.8
  creating build/lib.linux-x86_64-3.8/kvikio
  copying kvikio/cufile.py -> build/lib.linux-x86_64-3.8/kvikio
  copying kvikio/zarr.py -> build/lib.linux-x86_64-3.8/kvikio
  copying kvikio/thread_pool.py -> build/lib.linux-x86_64-3.8/kvikio
  copying kvikio/_version.py -> build/lib.linux-x86_64-3.8/kvikio
  copying kvikio/__init__.py -> build/lib.linux-x86_64-3.8/kvikio
  creating build/lib.linux-x86_64-3.8/kvikio/_lib
  copying kvikio/_lib/__init__.py -> build/lib.linux-x86_64-3.8/kvikio/_lib
  copying kvikio/_lib/arr.pyi -> build/lib.linux-x86_64-3.8/kvikio/_lib
  UPDATING build/lib.linux-x86_64-3.8/kvikio/_version.py
  set build/lib.linux-x86_64-3.8/kvikio/_version.py to '0+unknown'
  running build_ext
  building 'kvikio._lib.libkvikio' extension
  creating build/temp.linux-x86_64-3.8
  creating build/temp.linux-x86_64-3.8/kvikio
  creating build/temp.linux-x86_64-3.8/kvikio/_lib
  gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O3 -Wall -I/home/linuxbrew/.linuxbrew/opt/zlib -I/home/linuxbrew/.linuxbrew/opt/zlib -fPIC -O3 -I/home/gbae/.pyenv/versions/3.8.12/include -I/home/gbae/repo/kvikio/cpp/include -I/usr/local/cuda/include -I/home/gbae/.pyenv/versions/cucim-3.8/include -I/home/gbae/.pyenv/versions/3.8.12/include/python3.8 -c kvikio/_lib/libkvikio.cpp -o build/temp.linux-x86_64-3.8/kvikio/_lib/libkvikio.o -std=c++17
  kvikio/_lib/libkvikio.cpp:15862:20: warning: ‘__pyx_mdef___pyx_memoryviewslice_3__setstate_cython__’ defined but not used [-Wunused-variable]
  15862 | static PyMethodDef __pyx_mdef___pyx_memoryviewslice_3__setstate_cython__ = {"__setstate_cython__", (PyCFunction)__pyx_pw___pyx_memoryviewslice_3__setstate_cython__, METH_O, 0};
        |                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  kvikio/_lib/libkvikio.cpp:15804:20: warning: ‘__pyx_mdef___pyx_memoryviewslice_1__reduce_cython__’ defined but not used [-Wunused-variable]
  15804 | static PyMethodDef __pyx_mdef___pyx_memoryviewslice_1__reduce_cython__ = {"__reduce_cython__", (PyCFunction)__pyx_pw___pyx_memoryviewslice_1__reduce_cython__, METH_NOARGS, 0};
        |                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  kvikio/_lib/libkvikio.cpp:12959:20: warning: ‘__pyx_mdef___pyx_memoryview_3__setstate_cython__’ defined but not used [-Wunused-variable]
  12959 | static PyMethodDef __pyx_mdef___pyx_memoryview_3__setstate_cython__ = {"__setstate_cython__", (PyCFunction)__pyx_pw___pyx_memoryview_3__setstate_cython__, METH_O, 0};
        |                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  kvikio/_lib/libkvikio.cpp:12901:20: warning: ‘__pyx_mdef___pyx_memoryview_1__reduce_cython__’ defined but not used [-Wunused-variable]
  12901 | static PyMethodDef __pyx_mdef___pyx_memoryview_1__reduce_cython__ = {"__reduce_cython__", (PyCFunction)__pyx_pw___pyx_memoryview_1__reduce_cython__, METH_NOARGS, 0};
        |                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  kvikio/_lib/libkvikio.cpp:12807:20: warning: ‘__pyx_mdef_15View_dot_MemoryView_10memoryview_23copy_fortran’ defined but not used [-Wunused-variable]
  12807 | static PyMethodDef __pyx_mdef_15View_dot_MemoryView_10memoryview_23copy_fortran = {"copy_fortran", (PyCFunction)__pyx_memoryview_copy_fortran, METH_NOARGS, 0};
        |                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  kvikio/_lib/libkvikio.cpp:12712:20: warning: ‘__pyx_mdef_15View_dot_MemoryView_10memoryview_21copy’ defined but not used [-Wunused-variable]
  12712 | static PyMethodDef __pyx_mdef_15View_dot_MemoryView_10memoryview_21copy = {"copy", (PyCFunction)__pyx_memoryview_copy, METH_NOARGS, 0};
        |                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  kvikio/_lib/libkvikio.cpp:12635:20: warning: ‘__pyx_mdef_15View_dot_MemoryView_10memoryview_19is_f_contig’ defined but not used [-Wunused-variable]
  12635 | static PyMethodDef __pyx_mdef_15View_dot_MemoryView_10memoryview_19is_f_contig = {"is_f_contig", (PyCFunction)__pyx_memoryview_is_f_contig, METH_NOARGS, 0};
        |                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  kvikio/_lib/libkvikio.cpp:12558:20: warning: ‘__pyx_mdef_15View_dot_MemoryView_10memoryview_17is_c_contig’ defined but not used [-Wunused-variable]
  12558 | static PyMethodDef __pyx_mdef_15View_dot_MemoryView_10memoryview_17is_c_contig = {"is_c_contig", (PyCFunction)__pyx_memoryview_is_c_contig, METH_NOARGS, 0};
        |                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  kvikio/_lib/libkvikio.cpp:8685:20: warning: ‘__pyx_mdef___pyx_MemviewEnum_3__setstate_cython__’ defined but not used [-Wunused-variable]
   8685 | static PyMethodDef __pyx_mdef___pyx_MemviewEnum_3__setstate_cython__ = {"__setstate_cython__", (PyCFunction)__pyx_pw___pyx_MemviewEnum_3__setstate_cython__, METH_O, 0};
        |                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  kvikio/_lib/libkvikio.cpp:8449:20: warning: ‘__pyx_mdef___pyx_MemviewEnum_1__reduce_cython__’ defined but not used [-Wunused-variable]
   8449 | static PyMethodDef __pyx_mdef___pyx_MemviewEnum_1__reduce_cython__ = {"__reduce_cython__", (PyCFunction)__pyx_pw___pyx_MemviewEnum_1__reduce_cython__, METH_NOARGS, 0};
        |                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  kvikio/_lib/libkvikio.cpp:8071:20: warning: ‘__pyx_mdef___pyx_array_3__setstate_cython__’ defined but not used [-Wunused-variable]
   8071 | static PyMethodDef __pyx_mdef___pyx_array_3__setstate_cython__ = {"__setstate_cython__", (PyCFunction)__pyx_pw___pyx_array_3__setstate_cython__, METH_O, 0};
        |                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  kvikio/_lib/libkvikio.cpp:8013:20: warning: ‘__pyx_mdef___pyx_array_1__reduce_cython__’ defined but not used [-Wunused-variable]
   8013 | static PyMethodDef __pyx_mdef___pyx_array_1__reduce_cython__ = {"__reduce_cython__", (PyCFunction)__pyx_pw___pyx_array_1__reduce_cython__, METH_NOARGS, 0};
        |                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  g++ -pthread -shared -L/home/linuxbrew/.linuxbrew/opt/readline/lib -L/home/gbae/.pyenv/versions/3.8.12/lib -L/home/linuxbrew/.linuxbrew/lib -L/home/linuxbrew/.linuxbrew/opt/readline/lib -L/home/gbae/.pyenv/versions/3.8.12/lib -L/home/linuxbrew/.linuxbrew/lib build/temp.linux-x86_64-3.8/kvikio/_lib/libkvikio.o -L/home/gbae/.pyenv/versions/cucim-3.8/lib/python3.8/site-packages -L/usr/local/cuda/lib64 -lcuda -lnvidia-ml -lcufile -o build/lib.linux-x86_64-3.8/kvikio/_lib/libkvikio.cpython-38-x86_64-linux-gnu.so
  /home/linuxbrew/.linuxbrew/bin/ld: cannot find -lnvidia-ml
  collect2: error: ld returned 1 exit status
  error: command '/usr/bin/g++' failed with exit code 1
  ----------------------------------------
  ERROR: Failed building wheel for kvikio

I workarounded the issue through the following change.

--- a/python/setup.py
+++ b/python/setup.py
@@ -45,8 +45,8 @@ this_setup_scrip_dir = os.path.dirname(os.path.realpath(__file__))
 
 # Throughout the script, we will populate `include_dirs`,
 # `library_dirs` and `depends`.
-include_dirs = [os.path.dirname(sysconfig.get_path("include"))]
-library_dirs = [get_python_lib()]
+include_dirs = [os.path.dirname(sysconfig.get_path("include"))] + ["/home/gbae/repo/kvikio/cpp/include"]
+library_dirs = [get_python_lib(), "/usr/local/cuda/lib64/stubs"]
 extra_objects = []
 depends = []  # Files to trigger rebuild when modified

Filter out Zarr's consolidated metadata (`.zmetadata`)

Currently there are a few metadata keys that get handled specially in the GDSStore (see below).

kvikio/python/kvikio/zarr.py

Lines 35 to 39 in 0251229

 if os.path.basename(fn) in [ 

 zarr.storage.array_meta_key, 

 zarr.storage.group_meta_key, 

 zarr.storage.attrs_key, 

 ]:

Though one that isn't handled currently is the consolidated metadata key (.zmetadata), which can show up for some Zarr stores.

Would be good to filter this one out too

Use `nvcomp` package

Currently nvCOMP is fetched by CMake

kvikio/python/cmake/CMakeLists.txt

Line 15 in d9eee8b

include(thirdparty/get_nvcomp.cmake)

Now that there is an nvcomp conda package, it would be good to switch over to using that instead

Limit pinned memory pool chunk size

The size of chunks in the pinned memory pool should be limited by the slice size, since each chunk is always used by a single thread.

Include Zarr example/notebook

For users getting started, it would be nice to have either an example in a tutorial or a notebook showing how to plug KvikIO into Zarr.

[Question] kvikIO out-of-the-box support for S3 reads

Hi Team,

Is there any out of the box solution provided for object reads from S3 over any known sdk -- boto or azure for Cufile?

I could not find any ready solution for this hence decided to open a query here....

I am currently working an an infra project to test the performances of storage devices -- and a part of that work involves object reads from S3.
I have tried downloading the S3 object and then go for Cufile reads, however, it seems somewhat like taking a detour to reach the destination -- and I am mostly concerned with the performance implications over the extra code!

Therefore I wanted to know if there are any solutions provided in kvikIO for such a use case? If not, maybe you could suggest an alternative to approaching the problem without going through the download & read path?

N.B.: I have also played around with DALI, so open to any suggestions, if it involves DALI as well....

Many thanks in advance.

Relax NumPy & CuPy dependency in nvCOMP

The nvCOMP portion of the code here currently requires NumPy & CuPy whereas KvikIO does not. It may be worthwhile to relax this constraint in nvCOMP to simplify usage requirements.

Cannot Install Kvikio using conda

Hello!

I tried to install Kvikio using the following code:

conda install -c rapidsai -c conda-forge kvikio

but anaconda said it couldn't find the package, I am using a windows 10 system, can anyone help me resolve this problem? Thank you!

KvikIO: Need to canonicalize dlopen'd library names

Documenting an offline discussion with @jakirkham in preparation of CUDA 12 bring-up on conda-forge.

Currently, there're some places where libXXXXX.so without any SONAME being dlopen'd, for example,

kvikio/cpp/include/kvikio/shim/cufile.hpp

Line 53 in db0a3e7

void* lib = load_library({"libcufile.so.1", "libcufile.so.0", "libcufile.so"});

There will become problematic because the libXXXXX.so symlink is supposed to exist only in the libXXXXX-dev package by the stand practice of Linux distros, not in libXXXXX which only contains libXXXXX.so.{major}. The dlopen'd names need to be canonicalized.

Note: Text borrowed from Leo's issue ( rapidsai/cudf#12708 ) and tweaked

c++ `basic_io` example doesn't work out of the box for me

Hello, the steps for reproducing:

gh repo clone rapidsai/kvikio
cd kvikio/cpp
mkdir build
cd build
cmake .. -DCMAKE_BUILD_TYPE=Debug
make
./basic_io
KvikIO defaults: 
  Compatibility mode: disabled
DriverProperties: 
  Version: 0.0
  Allow compatibility mode: true
  Pool mode - enabled: false, threshold: 4 kb
  Max pinned memory: 4294967292 kb
terminate called after throwing an instance of 'std::system_error'

debugging:

cgdb ./basic_io
...
bt
...
#7  0x00007ffff78a5fcd in __cxxabiv1::__cxa_throw (obj=<optimized out>, tinfo=0x555555597860 <typeinfo for std::system_error@GLIBCXX_3.4.11>, dest=0x7ffff78d5fd0 <std::system_error::~system_error()>)
    at /usr/src/debug/gcc/libstdc++-v3/libsupc++/eh_throw.cc:98
#8  0x0000555555564dd4 in kvikio::detail::open_fd (file_path="/tmp/test-file", flags="w", o_direct=true, mode=436) at /home/alsam/work/github/kvikio/cpp/examples/../include/kvikio/file_handle.hpp:85
#9  0x0000555555564f48 in kvikio::FileHandle::FileHandle (this=0x7fffffffe6c0, file_path="/tmp/test-file", flags="w", mode=436, compat_mode=false)

failed to open the file /tmp/test-file for writing with o_direct=true, mode=436

$ touch /tmp/test-file
$ echo $?
0

Thanks

Consider using a 80-120% range for `nvcomp` test value sizes.

Create fixtures that allow a range variation in the nvcomp test values. These values are hard coded very specifically and any time a change in C++ nvcomp that slightly modifies their compression parameters, tests will fail here. If we provide fixtures with set values and a utility function for comparison that lets them vary slightly we'll gain the benefit of tracking their output and temporary data sizes, without having to constantly retool for failing CI.

why is compat mode faster than gpudirect read with the given python example?

Hi,
I tried the gdsio tool and works fine as expected, compat mode is slower than GPU direct reads. But when checking this library using the given python example, it doesn't work the same as expected (compat mode is faster than gpudirect read). Could you please let me know?

changed example that is given in README.md

!/usr/bin/python
import cupy
import kvikio
import time
import kvikio.defaults

def main(nelem):
    print("Tensor size:", nelem)
    path = "/mnt/tmp/test-file"

    start_time = time.time()
    a = cupy.arange(nelem)
    f = kvikio.CuFile(path, "w")
    # Write whole array to file
    f.write(a)
    f.close()
    print("--- %s seconds write time---" % (time.time() - start_time))

    # Read whole array from file
    start_time = time.time()
    b = cupy.empty_like(a)
    print("--- %s seconds buffer creation---" % (time.time() - start_time))
    f = kvikio.CuFile(path, "r")
    f.read(b)
    print("--- %s seconds buffer creation and load time---" % (time.time() - start_time))
    assert all(a == b)

    # Use contexmanager
    start_time = time.time()
    c = cupy.empty_like(a)
    with kvikio.CuFile(path, "r") as f:
        f.read(c)
    print("--- %s seconds buffer creation and load time with context manager---" % (time.time() - start_time))
    assert all(a == c)

    # Non-blocking read
    start_time = time.time()
    d = cupy.empty_like(a)
    with kvikio.CuFile(path, "r") as f:
        future1 = f.pread(d[:nelem/2])
        future2 = f.pread(d[nelem/2:], file_offset=d[:nelem/2].nbytes)
        future1.get()  # Wait for first read
        future2.get()  # Wait for second read
    print("--- %s seconds buffer creation and load time with non block read---" % (time.time() - start_time))
    start_time = time.time()
    assert all(a == d)
    print("--- %s seconds assertion time---" % (time.time() - start_time))

arr_sizes = [100, 1000000]
kvikio.defaults.compat_mode_reset(False)
assert not kvikio.defaults.compat_mode()
for elem in arr_sizes:
    main(elem)
kvikio.defaults.compat_mode_reset(True)
assert kvikio.defaults.compat_mode()
print("COMPAT MODE..")
for elem in arr_sizes:
    main(elem)

output:

Tensor size: 100
--- 0.36174535751342773 seconds write time---
--- 0.00011444091796875 seconds buffer creation---
--- 0.003509044647216797 seconds buffer creation and load time---
--- 0.000995635986328125 seconds buffer creation and load time with context manager---
--- 0.0020360946655273438 seconds buffer creation and load time with non block read---
--- 0.0019338130950927734 seconds assertion time---
Tensor size: 1000000
--- 0.3805253505706787 seconds write time---
--- 0.0003936290740966797 seconds buffer creation---
--- 0.02455925941467285 seconds buffer creation and load time---
--- 0.045484304428100586 seconds buffer creation and load time with context manager---
--- 0.07375788688659668 seconds buffer creation and load time with non block read---
--- 18.388749361038208 seconds assertion time---
COMPAT MODE..
Tensor size: 100
--- 0.04293179512023926 seconds write time---
--- 9.965896606445312e-05 seconds buffer creation---
--- 0.00044655799865722656 seconds buffer creation and load time---
--- 0.0001728534698486328 seconds buffer creation and load time with context manager---
--- 0.00022649765014648438 seconds buffer creation and load time with non block read---
--- 0.0018930435180664062 seconds assertion time---
Tensor size: 1000000
--- 0.05194258689880371 seconds write time---
--- 1.8596649169921875e-05 seconds buffer creation---
--- 0.002271890640258789 seconds buffer creation and load time---
--- 0.002416372299194336 seconds buffer creation and load time with context manager---
--- 0.0020689964294433594 seconds buffer creation and load time with non block read---
--- 18.475173473358154 seconds assertion time---

conda/mamba nightly install fails

I tried the conda create line in the readme but it doesn't work

conda create -n kvikio_env -c rapidsai-nightly -c conda-forge python=3.8 cudatoolkit=11.5 kvikio

Mamba fails with

Encountered problems while solving:
  - package kvikio-22.04.00a220318-cuda_11_py38_g4a300f5_27 requires python_abi 3.8.* *_cp38, but none of the providers can be installed

Conda fails with

nsatisfiableError: The following specifications were found to be incompatible with each other:

Output in format: Requested package -> Available versions

Package libgcc-ng conflicts for:
kvikio -> libgcc-ng[version='>=12']
kvikio -> cudatoolkit[version='>=11,<12.0a0'] -> libgcc-ng[version='>=10.3.0|>=9.4.0|>=9.3.0|>=7.3.0|>=7.5.0|>=5.4.0']

Package python conflicts for:
kvikio -> cupy -> python[version='2.7.*|3.5.*|3.6.*|3.8.*|>=2.7,<2.8.0a0|>=3.10,<3.11.0a0|>=3.7,<3.8.0a0|>=3.8,<3.9.0a0|>=3.9,<3.10.0a0|>=3.6,<3.7.0a0|>=3.5,<3.6.0a0|>
=3.5|3.4.*|>=3.6,<4|3.9.*']
python=3.8

Package libstdcxx-ng conflicts for:
kvikio -> libstdcxx-ng[version='>=12']
kvikio -> cudatoolkit[version='>=11,<12.0a0'] -> libstdcxx-ng[version='>=10.3.0|>=9.4.0|>=9.3.0|>=7.3.0|>=7.5.0|>=5.4.0']

Package cudatoolkit conflicts for:
cudatoolkit=11.5
kvikio -> cupy -> cudatoolkit[version='10.0|10.0.*|10.1|10.1.*|10.2|10.2.*|11.0|11.0.*|11.4|11.4.*|>=11.2,<12|11.1|11.1.*|9.2|9.2.*|>=11.2,<12.0a0|>=10.0.130,<10.1.0a0
|>=9.2,<9.3.0a0|>=9.0,<9.1.0a0|>=8.0,<9.0a0|8.0.*|9.0.*|>=11.5,<12']
kvikio -> cudatoolkit[version='>=11,<12.0a0']

Package _openmp_mutex conflicts for:
kvikio -> libgcc-ng[version='>=12'] -> _openmp_mutex[version='>=4.5']
python=3.8 -> libgcc-ng[version='>=10.3.0'] -> _openmp_mutex[version='>=4.5']
cudatoolkit=11.5 -> libgcc-ng[version='>=10.3.0'] -> _openmp_mutex[version='>=4.5']The following specifications were found to be incompatible with your system:

  - feature:/linux-64::__glibc==2.17=0
  - feature:|@/linux-64::__glibc==2.17=0
  - kvikio -> __glibc[version='>=2.17,<3.0.a0']
  - kvikio -> cupy -> __glibc[version='>=2.17']
  - python=3.8 -> libgcc-ng[version='>=9.3.0'] -> __glibc[version='>=2.17']

Your installed version is: 2.17

Note that strict channel priority may have removed packages required for satisfiability.

Conda Package

We should create Conda packages, both regular and nightly and releases.

@quasiben and @jakirkham I am hope this is something you guys can help with? :)

Adding checking for "work done" status to IOFuture

Hi all,

it would be helpful if IOFuture has a .done() member such as concurrent futures, i.e. a routine which checks whether the IO is complete. This would help iterating through lists/queues of outstanding requests and working on those which are done first.

for handle in iofuture_list:
   if not handle.done():
      continue
   #do whatever

Best Regards
Thorsten

[Question] CuFile implementation with RDMA

Hi,

I was wondering if we could use the kvikIO Cufile for RDMA IO ops?

I am working on a project that involves Client space Read/Write from our own distributed filesystem (RDMA capable).
Now that the system read/write has been established, I am looking to add some further implementation that involves direct IO through the GPU.
I am using DGX2 which runs my client application.

So, is there any way I could use the kvikIO implementation for the above use case?
If yes, could you guide me with an example code implementation, which could help in this regard?

Note:: I am already aware of the use of Mellanox DC to implement RDMA Read/Write to a remote GPU mem. But that does not use a Cufile implementation. What I need specifically is how to implement Cufile with RDMA?

An ERROR OCCURS AFTER CONDA INSTALL

We use this command to install kvikio as follows:
conda create -n kvikio_env -c rapidsai-nightly -c conda-forge python=3.8 cudatoolkit=11.5 kvikio
But When I use test code, an error occurs:

Some environmental configuration information is as follows.
RuntimeError: libcuda.so: cannot open shared object file: No such file or directory

But we can find the file in：
/usr/local/cuda-11.7/targets/x86_64-linux/lib/stubs/

My Cuda version is 11.7 and my GPU is P100.
Linux kernel is 5.4.0-70, gcc and g++ both version 11.

And the environment is:

export PATH=$PATH:/root/anaconda3/bin

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.7/lib64
export PATH=$PATH:/usr/local/cuda-11.7/bin
export CUDA_HOME=$CUDA_HOME:/usr/local/cuda-11.7
export CUFILE_PATH="/usr/local/cuda-11.7/targets/x86_64-linux/lib/"
export DALI_EXTRA_PATH="/mnt/optane/wjtest/DALI_extra/"

# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$('/root/anaconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
    eval "$__conda_setup"
else
    if [ -f "/root/anaconda3/etc/profile.d/conda.sh" ]; then
        . "/root/anaconda3/etc/profile.d/conda.sh"
    else
        export PATH="/root/anaconda3/bin:$PATH"
    fi
fi
unset __conda_setup
# <<< conda initialize <<<
source activate

Adding a done callback on `IOFuture`

As synchronization might be quite expensive, one common strategy a user might want to employ is adding a callback to their IOFutures. This could be done using an API like concurrent.futures.Future.add_done_callback. IOW call the callback with the IOFuture as an object. This way the result or exception could be gotten and operated on if needed without blocking.

Nvidia GPUDirect Storage support

Hello. I decided to ask for help here, because I'm lost. GDS seems very unfriendly to get working. Do you have any instructions on how to get it working? As a developes how do you get it? Are you using ancient ubuntu 20.04? Thanks

Function to `wait` on multiple `IOFuture`s

Currently one needs to perform each read/write and wait on them individually. Like so

kvikio/python/examples/hello_world.py

Lines 31 to 34 in 7a43759

 future1 = f.pread(d[:50]) 

 future2 = f.pread(d[50:], file_offset=d[:50].nbytes) 

 future1.get() # Wait for first read 

 future2.get() # Wait for second read

However one use case is to do a bunch of writes or reads asynchronously. For example ( zarr-developers/zarr-python#1040 ). Doing this would accumulate a list of IOFutures from each operation.

In this case it would be helpful to have an API that is able to wait on that whole list of operations to complete. Perhaps like concurrent.futures.wait.

Example of Zarr access from C++?

Is there an example of using Zarr IO from C++? Is arbitrary metadata supported? Is there an example of that? Does kvikio even offer Zarr IO? After #82 is integrated, would it be possible to read/write Zarr directly from the host memory?

Support Cloud Solution Provider

cufile.so might crash when used within a VM in the cloud.
KvikIO should detect this and fallback to its own implementation.

@gigony

Writing a mix of host & device frames doesn't work

Currently if a user tries to write a mixture of host & device frames, this doesn't work as the following error will show up for any host frames

kvikio/python/kvikio/_lib/libkvikio.pyx

Line 86 in 7a43759

raise NotImplementedError("Non-CUDA buffers not implemented")

This matters as using KvikIO in Distributed & Dask-CUDA will depend on handling some host frames along the way while reading and writing.

Maybe there could be some kind of fallback to handle host memory in these cases?

Can't find libcufile.so.1

When trying to force the benchmark to run with GDS enabled I get the following error:

KVIKIO_COMPAT_MODE=0 python benchmarks/single-node-io.py

Traceback (most recent call last):
  File "/home/quasiben/Github/kvikio/python/benchmarks/single-node-io.py", line 401, in <module>
    main(args)
  File "/home/quasiben/Github/kvikio/python/benchmarks/single-node-io.py", line 307, in main
    read, write = API[api](args)
  File "/home/quasiben/Github/kvikio/python/benchmarks/single-node-io.py", line 28, in run_cufile
    kvikio.memory_register(data)
  File "/home/quasiben/miniconda3/envs/kvikio_dev/lib/python3.9/site-packages/kvikio/__init__.py", line 13, in memory_register
    return libkvikio.memory_register(buf)
  File "libkvikio.pyx", line 44, in kvikio._lib.libkvikio.memory_register
RuntimeError: libcufile.so.1: cannot open shared object file: No such file or directory

I just installed CUDA 11.8 and confirmed I have libcufile.so:

/usr/local/cuda/lib64/libcufile.so -> libcufile.so.0

Additionally, during buliding kvikIO finds cuFILE:

cmake -DCMAKE_INSTALL_PREFIX=${CONDA_PREFIX} ${CMAKE_EXTRA_ARGS} ..
-- The CXX compiler identification is GNU 9.5.0
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /home/quasiben/miniconda3/envs/kvikio_dev/bin/x86_64-conda-linux-gnu-c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found CUDAToolkit: /usr/local/cuda/include (found version "11.8.89")
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Found cuFile: /usr/local/cuda/lib64/libcufile.so
-- Configuring done

Should we also search for libcufile.so in addition to libcufile.so.1 ?

void* lib = load_library("libcufile.so.1");

KvikIO: CUDA 12 Conda Packages

Pushing for CUDA 12 Conda package builds of libraries for RAPIDS 23.04. This issue is for tracking that work.

Requirements (please recheck):

Developer tasks

Beta Give feedback

I've confirmed the list below is correct (or updated if not)
Options

Toolkit dependencies from Conda

Beta Give feedback

Adding cuda-cudart conda-forge/staged-recipes#21723

review-requested staged-recipes
Adding cuda-nvcc recipe conda-forge/staged-recipes#21350
Add libcufile recipe conda-forge/staged-recipes#21908
Options

Other optional dependencies from Conda using CUDA

Beta Give feedback

Note: nvcomp is only need for compression (not IO). Also pynvml is only needed in the benchmarks (not for library usage).

Use `rapids_cpm_nvcomp` in `get_nvcomp.cmake`

Our get_nvcomp.cmake should be updated to use rapids_cpm_nvcomp from rapids-cmake. That will keep kvikio from getting out of sync with the rest of RAPIDS and better ensure that it stays up to date. I believe that there have been a lot of relevant bugfixes/improvements (for cudf) to nvcomp in the last couple of releases.

Originally posted by @vyasr in #96 (comment)

Add entrypoints for Numcodecs compressors

zarr-developers/numcodecs#316 (comment)

[FEATURE] Connect KvikIO to a File Descriptior

I'm requesting the ability to connect a file descriptor, like a socket handle, to Kvikio so the data from that descriptor can be polled from gpu and directly loaded without having to reference back to a CPU data handle. Currently, the CuFile submodule of KvikIO does not support file descriptors.

A similar API to this in python might be the socket module.

Usecase:
Data streaming from a socket connection to a waiting process with a fully GPU workload.

Tasks

Beta Give feedback

No tasks being tracked yet.

Options

Crash/segfault on exit when running libcudf tests with kvikIO and CUDA 11.8

Getting inconsistent errors at the end of process lifetime when "KVIKIO" policy is used in cuIO.
No issues in the actual tests.
The are multiple possible outcomes: crash, segfault, success. Sample output:

30: double free or corruption (!prev)
1/3 Test #30: JSON_TEST ........................Subprocess aborted***Exception:   2.68 sec

Repro persists when all cuFile calls outside of kvikIO are removed.

No repro with other versions of CUDA toolkit.
No repro when using libcudf's cuFile path instead of kvikIO.

Basic HDF5 support

We can use Kerchunk to GDS accelerate HDF5 reads like we do in the Legate backend: #222
However, Kerchunk doesn't support writes thus for that we need another approach.

Use `rapids-cython` & other more recent best practices

Would it make sense to start using rapids-cython here as in cuDF?

https://github.com/rapidsai/cudf/blob/766af666575b5db6f265e1fbf6466ccfd46eae30/python/cudf/CMakeLists.txt#L72

Any other best practices we should adopt here to modernize our scikit-build/CMake files?

Maybe some of these items are worth addressing ( #58 (comment) )

Adding wheels for KvikIO

Would be useful to have wheels for KvikIO as well. This could be helpful for users coming from pip (like some Dask-CUDA users). Raising this to track adding them.

Note: There may be some things that need to be fixed as part of this like issue ( #233 )

Tracking support of new cuFile features

Meta Issue to track support of new cuFile features.

[FEA] Debug C++ build

It would be good to build the C++ example in CI with -DCMAKE_BUILD_TYPE=Debug in order to detect issues downstream such as rapidsai/cudf#10703.
If we could also run the example (basic_io), it would be great but that isn't essential.

Track Zarr-Python Integration

Meta Issue to track work related to Zarr-KvikIO

Zarr PRs:

KvikIO PRs:

#129
#131

Unable to create a working build inside a container

I am running inside a Docker container, specifically one created using rapids-compose. I am able to successfully build both the Python and C++ kvikio libraries, but both of them fail at runtime with a cuFile internal error. For example, using the exact C++ building instructions here appears to succeed, but then running the basic_io executable gives

(rapids) rapids@compose:~/kvikio/cpp/build$ ./examples/basic_io 
KvikIO config: 
  Compatibility mode: disabled
DriverProperties: 
  Version: 0.0
  Allow compatibility mode: true
  Pool mode - enabled: false, threshold: 4 kb
  Max pinned memory: 33554432 kb
terminate called after throwing an instance of 'kvikio::CUfileException'
  what():  cuFile error at: /home/vyasr/local/rapids/kvikio/cpp/examples/../include/kvikio/file_handle.hpp:117: internal error
Aborted (core dumped)

I get the same cuFile... internal error if I try to run pytest on anything in the python/tests directory.

Adding multiline/IOV APIs

In some cases a user needs to write a list of binary data (like using writelines). It would be useful to have an API like this when working with multiple frames.

Similarly it might be handy to have some kind of API for reading in data to multiple buffers. There is not exactly an analogous API in Python. Though socket.recvmsg_into is close (so might be a starting point).

This can be particularly helpful if the GIL is released in the background as there can be one API call that releases the GIL as opposed to a couple in rapid succession.

inconsistency of reset function names in kvikio.defaults

It seems a bit counterintuitive that there is
reset_num_threads
reset_task_size
but
compat_mode_reset

For consistency, I would have expected the last one to be reset_compat_mode. Do you think this is worth changing via a deprecation cycle?

The C++ API consistently has all three functions in a form ending in _reset, but if we did that on the Python side we would have to change two of the existing functions.

nvcomp LZ4 compression Issue

Hi,

I'm trying to compress a file using the nvcomp LZ4Compressor but ended up gettting file more than the actual size.
Kindly advise what i'm doing wrong.

import os
import cupy as cp
import numpy as np
import kvikio.nvcomp as nvc
import humanize

DTYPE = cp.int32
dtype = cp.dtype(DTYPE)

filename = '/home/arul/Downloads/kjv10.txt'
print('File :', filename, ' Size: ', humanize.naturalsize(os.path.getsize(filename)))
testfile = open(filename).read()
data = np.frombuffer(bytes(testfile, 'utf-8'), dtype=np.int8)

data_gpu = cp.array(data, dtype=DTYPE)
compressor = nvc.LZ4Compressor(data_gpu.dtype)
compressed = compressor.compress(data_gpu)
gpu_comp_file = open('/home/arul/Downloads/kjv10-gpu-compressed.txt.lz4', 'wb')
gpu_comp_file.write(compressed.tobytes())
gpu_comp_file.close()

print('Compressed Size: ', humanize.naturalsize(compressed.size))
del compressor
del compressed

Result:
File : /home/arul/Downloads/kjv10.txt Size: 4.4 MB
Compressed Size: 17.4 MB

A context for Codec `encode_batch` and `decode_batch`

Before proposing the new encode_batch and decode_batch API (#248) to numcodecs Codec, I think we should introduce a new context argument similar to Zarr's Context.

This way, we can use the meta_array option to specify the output memory type of encode_batch and decode_batch.

cc. @Alexey-Kamenev

[FEATURE] Support for nvCOMP batch API

This feature adds support for nvCOMP batch/low-level API which allows to process multiple chunks in parallel.

The proposed implementation provides an easy way to use the API via well-known numcodecs Codec API. Using numcodecs also enables seamless integration with libraries such as zarr that use numcodecs internally.

Additionally, using nvCOMP batch API enables interoperability between existing codecs and nvCOMP batch codec. For example, the data can be compressed on CPU using default LZ4 codec and then decompressed on GPU using proposed nvCOMP batch codec.

To support batch mode, Codec interface was extended with functions, encode_batch and decode_batch, which implement batch mode.

Note that the current version of zarr does not support chunk-parallel functionality, but there is a proposal for this feature.

Currently the following compression/decompression algorithms are supported:

LZ4
Gdeflate
zstd
Snappy

nvCOMP also supports other algorithms which can be relatively easily added to kvikio.

Example of usage:

Simple use of Codec batch API:

import numcodecs
import numpy as np

# Get the codec from the numcodecs registry.
codec = numcodecs.registry.get_codec(dict(id="nvcomp_batch", algorithm="lz4"))

# Creater 2 chunks. The chunks do not have to be the same size.
shape = (4, 8)
chunk1, chunk2 = np.random.randn(2, *shape).astype(np.float32)

# Compress data.
data_comp = codec.encode_batch([chunk1, chunk2])

# Decompress.
data_decomp = codec.decode_batch(data_comp)

# Verify.
np.testing.assert_equal(data_decomp[0].view(np.float32).reshape(shape), chunk1)
np.testing.assert_equal(data_decomp[1].view(np.float32).reshape(shape), chunk2)

Using with zarr (no parallel chunking yet - see the note above).

import numcodecs
import numpy as np
import zarr

# Get the codec from the numcodecs registry.
codec = numcodecs.registry.get_codec(dict(id="nvcomp_batch", algorithm="lz4"))
shape = (16, 16)
chunks = (8, 8)

# Create data and compress.
data = np.random.randn(*shape).astype(np.float32)
z1 = zarr.array(data, chunks=chunks, compressor=codec)

# Store in compressed format.
zarr_store = zarr.MemoryStore()
zarr.save_array(zarr_store, z1, compressor=codec)

# Read back/decompress.
z2 = zarr.open_array(zarr_store)

np.testing.assert_equal(z1[:], z2[:])

If desired, the API can also be used directly, without having to use numcodecs API.

`numpy._typing` requires numpy>=1.23

running the test cases with numpy 1.22.2:

/opt/kvikio/python# pytest tests/
=========================================================================================================== test session starts ===========================================================================================================
platform linux -- Python 3.8.10, pytest-7.2.2, pluggy-1.0.0
rootdir: /opt/kvikio/python
plugins: typeguard-2.13.3, shard-0.1.2, xdist-3.2.1, hypothesis-5.35.1, rerunfailures-11.1.2, xdoctest-1.0.2
collected 381 items / 1 error / 1 skipped                                                                                                                                                                                                 
Running 381 items in this shard

================================================================================================================= ERRORS ==================================================================================================================
__________________________________________________________________________________________________ ERROR collecting tests/test_numpy.py ___________________________________________________________________________________________________
ImportError while importing test module '/opt/kvikio/python/tests/test_numpy.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/usr/lib/python3.8/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
tests/test_numpy.py:7: in <module>
    from kvikio.numpy import LikeWrapper, tofile
/usr/local/lib/python3.8/dist-packages/kvikio/numpy.py:10: in <module>
    from numpy._typing._array_like import ArrayLike
E   ModuleNotFoundError: No module named 'numpy._typing'

cf. numpy._typing is introduced since numpy 1.23.0
numpy/numpy@7739583

[DOC] Include API documentation for compression

Currently we have API docs for Cufile, but not for the compression components. Would be good to include something about these.

cc @thomcom

Static linking of cufile

I am trying to statically link cufile (libcufile_static.a) but it fails when calling the cuda api. Am I doing something stupid?

Reproducer

`mylib.cpp`

#include <iostream>
#include <cassert>
#include <nvml.h>
#include <cuda.h>
#include <cuda_runtime_api.h>

using namespace std;

void test() {
    cout << "test()" << endl;
    nvmlReturn_t e1 = nvmlInit();
    assert(e1 == NVML_SUCCESS);

    cudaError e2 = cudaSetDevice(0);
    assert(e2 == cudaSuccess);

    CUdevice cu_dev{};
    CUresult e3 = cuCtxGetDevice(&cu_dev);
    cout << "cuCtxGetDevice(): " << e3 << endl;
    assert(e3 == CUDA_SUCCESS);
}

`myapp.cpp`

void test();

int main()
{
    test();
}

`build.sh`

set -e

CUFILE=/usr/local/cuda/lib64/libcufile_static.a
#CUFILE=

g++ -c -O1 -g -std=gnu++17 -fPIC -I /usr/local/cuda/include mylib.cpp

g++ -shared -std=gnu++17 -O1 -g mylib.o $CUFILE \
-L/usr/local/cuda/lib64 -lcudart -lnvidia-ml -lcuda -o mylib.so

g++ myapp.cpp  -std=gnu++17 -O1 -g  -Wl,-rpath,/usr/local/cuda/lib64 -Wl,-rpath,. mylib.so -o myapp

./myapp

Placing the three files in a folder and running sh build.sh produces:

$ sh build.sh 
test()
cuCtxGetDevice(): 999
myapp: mylib.cpp:21: void test(): Assertion `e3 == CUDA_SUCCESS' failed.
Aborted (core dumped)

Now, if I do dynamic linking instead by setting CUFILE= and adding -lcufile in the build script, it works:

$ sh build.sh 
test()
cuCtxGetDevice(): 0

Double free/Invalid pointer at end of benchmark run

I'm seeing either a double free or invalid pointer error every time I complete a benchmark run. Here are the logs from runs:

/mnt/nvme0/aborkar/kvikio/python/benchmarks$ python3 single-node-io.py -d /mnt/nvme0/aborkar/ -t 24 --nruns 3 2>&1 | tee kvikio_local_nvme.log
Roundtrip benchmark
----------------------------------
GPU               | NVIDIA A100-SXM4-80GB (dev #0)
GPU Memory Total  | 80.00 GiB
BAR1 Memory Total | 128.00 GiB
GDS driver        | v2.13
GDS config.json   | /usr/local/cuda-11.8/gds/cufile.json
----------------------------------
nbytes            | 10485760 bytes (10.00 MiB)
4K aligned        | True
pre-reg-buf       | True
diretory          | /mnt/nvme0/aborkar
nthreads          | 24
nruns             | 3
==================================
cufile read       |   1.64 GiB/s ±  5.49 % (1.66 GiB/s, 1.72 GiB/s, 1.54 GiB/s)
cufile write      |   3.31 GiB/s ± 12.42 % (2.88 GiB/s, 3.34 GiB/s, 3.70 GiB/s)
posix read        |   2.32 GiB/s ± 46.37 % (1.12 GiB/s, 2.63 GiB/s, 3.20 GiB/s)
posix write       |   0.95 GiB/s ± 13.13 % (824.43 MiB/s, 1.03 GiB/s, 1.01 GiB/s)
double free or corruption (!prev)

/mnt/nvme0/aborkar/kvikio/python/benchmarks$ python3 single-node-io.py -d /mnt/nvme0/aborkar/ -t 8 --nruns 3 2>&1 | tee kvikio_local_nvme.log
Roundtrip benchmark
----------------------------------
GPU               | NVIDIA A100-SXM4-80GB (dev #0)
GPU Memory Total  | 80.00 GiB
BAR1 Memory Total | 128.00 GiB
GDS driver        | v2.13
GDS config.json   | /usr/local/cuda-11.8/gds/cufile.json
----------------------------------
nbytes            | 10485760 bytes (10.00 MiB)
4K aligned        | True
pre-reg-buf       | True
diretory          | /mnt/nvme0/aborkar
nthreads          | 8
nruns             | 3
==================================
cufile read       |   1.50 GiB/s ±  9.57 % (1.36 GiB/s, 1.48 GiB/s, 1.65 GiB/s)
cufile write      |   3.52 GiB/s ± 12.89 % (3.00 GiB/s, 3.84 GiB/s, 3.72 GiB/s)
posix read        |   2.84 GiB/s ± 50.86 % (1.17 GiB/s, 3.66 GiB/s, 3.69 GiB/s)
posix write       |   0.96 GiB/s ± 15.34 % (814.17 MiB/s, 1.01 GiB/s, 1.08 GiB/s)
free(): invalid pointer

[Feature] Compression via Nvcomp

Hi! We've been talking about adding nvcomp to kvikio. I'm looking to add python bindings for Snappy, Cascaded, and Lz4 algorithms from nvcomp. In order to do so, we'll need to add the python bindings nvcomp.pyx and nvcomp.pxd to kvikio/python/kvikio/_lib and a wrapper for them, nvcomp.py. Once this is done I'll write tests.

A CMakeFlag -DUSE_NVCOMP=True will be added, disabled by default.

We're planning on using the nvcomp headers that are installed by cudf, which can be installed via conda, right?

I'm also looking into adding kvikio as another library option for https://github.com/trxcllnt/rapids-compose, which will make maintenance and development quite easy.

[Bug] GDS doesn't work on WSL

GDS doesn't work on WSL even for compatibility mode.

test.py

import cupy
import kvikio

a = cupy.arange(100)
f = kvikio.CuFile("test-file", "w")
# Write whole array to file
f.write(a)
f.close()

b = cupy.empty_like(a)
f = kvikio.CuFile("test-file", "r")
# Read whole array from file
f.read(b)
assert all(a == b)

❯ python test.py
Assertion failure, file index :cufio-udev  line :143
[1]    13856 abort      python test.py

FYI, cuCIM handles the issue by checking platform and skip executing cuFileDriverOpen().

Related Issues or Pull Requests

Work around scikit-build `include_package_data` bug in `legate/setup.py`

Not critical since kvikio doesn't have wheels, but scikit-build has bugs with include_package_data so specifying package_data explicitly (like in the non-legate setup.py) is safer. That said I don't see wheels happening for legate-kvikio anytime soon so it's mostly just to be safe.

Originally posted by @vyasr in #232 (comment)

Moving from `pynvml` to `nvidia-ml-py`

Currently this is making use of pynvml in a few places:

kvikio/python/benchmarks/single-node-io.py

Lines 248 to 250 in c152f63

 import pynvml.smi 

 nvsmi = pynvml.smi.nvidia_smi.getInstance()

However we would like to move to nvidia-ml-py in the future. Raising this issue to track this work

Optimize small reads and writes

The overhead of KvikIO becomes significant for small reads and writes.

In cuDF we had to skip KvikIO when reading and writing small buffers. See rapidsai/cudf#12780 and rapidsai/cudf#12841

	if os.path.basename(fn) in [
	zarr.storage.array_meta_key,
	zarr.storage.group_meta_key,
	zarr.storage.attrs_key,
	]:

	future1 = f.pread(d[:50])
	future2 = f.pread(d[50:], file_offset=d[:50].nbytes)
	future1.get() # Wait for first read
	future2.get() # Wait for second read

	import pynvml.smi

	nvsmi = pynvml.smi.nvidia_smi.getInstance()

rapidsai / kvikio Goto Github PK

kvikio's People

Contributors

Stargazers

Watchers

Forkers

kvikio's Issues

Note:: I am already aware of the use of Mellanox DC to implement RDMA Read/Write to a remote GPU mem. But that does not use a Cufile implementation. What I need specifically is how to implement Cufile with RDMA?

Developer tasks

Toolkit dependencies from Conda

Other optional dependencies from Conda using CUDA

Tasks

Reproducer

mylib.cpp

myapp.cpp

build.sh

Related Issues or Pull Requests

Recommend Projects

Recommend Topics

Recommend Org

`mylib.cpp`

`myapp.cpp`

`build.sh`