Git Product home page Git Product logo

Comments (7)

brandon-b-miller avatar brandon-b-miller commented on August 28, 2024

Hi @blue-cat-whale ,
Can you run the following code in a python interpreter and paste the output?

from ctypes import c_int, byref
from numba import cuda
dv = c_int(0)
cuda.cudadrv.driver.driver.cuDriverGetVersion(byref(dv))
drv_major = dv.value // 1000
drv_minor = (dv.value - (drv_major * 1000)) // 10
run_major, run_minor = cuda.runtime.get_version()
print(f'{drv_major} {drv_minor} {run_major} {run_minor}')

from cudf.

blue-cat-whale avatar blue-cat-whale commented on August 28, 2024

Hi @blue-cat-whale , Can you run the following code in a python interpreter and paste the output?

from ctypes import c_int, byref
from numba import cuda
dv = c_int(0)
cuda.cudadrv.driver.driver.cuDriverGetVersion(byref(dv))
drv_major = dv.value // 1000
drv_minor = (dv.value - (drv_major * 1000)) // 10
run_major, run_minor = cuda.runtime.get_version()
print(f'{drv_major} {drv_minor} {run_major} {run_minor}')
(cudf) [root@localhost nn]# python3
Python 3.11.5 (main, Sep 22 2023, 15:34:29) [GCC 8.5.0 20210514 (Red Hat 8.5.0-20)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from ctypes import c_int, byref
v.driver.driver.cuDriverGetVersion(byref(dv))
drv_major = dv.value // 1000
drv_minor = (dv.value - (drv_major * 1000)) // 10
run_major, run_minor = cuda.runtime.get_version()
print(f'{drv_major} {drv_minor} {run_major} {run_minor}')>>> from numba import cuda
>>> dv = c_int(0)
>>> cuda.cudadrv.driver.driver.cuDriverGetVersion(byref(dv))
>>> drv_major = dv.value // 1000
>>> drv_minor = (dv.value - (drv_major * 1000)) // 10
>>> run_major, run_minor = cuda.runtime.get_version()
>>> print(f'{drv_major} {drv_minor} {run_major} {run_minor}')
12 0 12 1
>>>

from cudf.

brandon-b-miller avatar brandon-b-miller commented on August 28, 2024

Thanks, this is helpful. Something strange is happening while numba is attempting to check the versions of cuda on your system. I would have expected the above command to show some more useful output, but it looks like we need to debug a bit deeper to reproduce the issue.

Would you be able to run this small python script in the failing environment and paste the output?

import sys
import subprocess

NUMBA_CHECK_VERSION_CMD = """\
from ctypes import c_int, byref
from numba import cuda
dv = c_int(0)
cuda.cudadrv.driver.driver.cuDriverGetVersion(byref(dv))
drv_major = dv.value // 1000
drv_minor = (dv.value - (drv_major * 1000)) // 10
run_major, run_minor = cuda.runtime.get_version()
print(f'{drv_major} {drv_minor} {run_major} {run_minor}')
"""

cp = subprocess.run(
    [sys.executable, "-c", NUMBA_CHECK_VERSION_CMD], capture_output=True
)
print(cp.stdout)

from cudf.

blue-cat-whale avatar blue-cat-whale commented on August 28, 2024

Thanks, this is helpful. Something strange is happening while numba is attempting to check the versions of cuda on your system. I would have expected the above command to show some more useful output, but it looks like we need to debug a bit deeper to reproduce the issue.

Would you be able to run this small python script in the failing environment and paste the output?

import sys
import subprocess

NUMBA_CHECK_VERSION_CMD = """\
from ctypes import c_int, byref
from numba import cuda
dv = c_int(0)
cuda.cudadrv.driver.driver.cuDriverGetVersion(byref(dv))
drv_major = dv.value // 1000
drv_minor = (dv.value - (drv_major * 1000)) // 10
run_major, run_minor = cuda.runtime.get_version()
print(f'{drv_major} {drv_minor} {run_major} {run_minor}')
"""

cp = subprocess.run(
    [sys.executable, "-c", NUMBA_CHECK_VERSION_CMD], capture_output=True
)
print(cp.stdout)
(cudf) [root@localhost test_cuda]# python tmp_0618.py
b'12 0 12 1\n2024-06-19 10:29:28 [DEBUG] After get lib arch_name=cuda lib_name=liborion_client_common.so, version=0, file_path=/root/.orion/lib/cuda4002000/liborion_client_common.so, ret=0.\n2024-06-19 10:29:28 [DEBUG] After get lib arch_name=cuda lib_name=cuda, version=0, file_path=/root/.orion/lib/cuda4002000/libcuda.so, ret=0.\n2024-06-19 10:29:28 [DEBUG] Using group resource 7cbca860-2ac2-4407-9c32-e4503043d0e5\n2024-06-19 10:29:28 [DEBUG] System initialization begin.\n2024-06-19 10:29:28 [DEBUG] Getting Orion resource ...\n2024-06-19 10:29:28 [DEBUG] Checking Unix socket at /var/tmp/orion/comm/orion.sock\n2024-06-19 10:29:28 [DEBUG] Requesting resource through /var/tmp/orion/comm/orion.sock\n2024-06-19 10:29:28 [INFO] Using Orion resource (7cbca860-2ac2-4407-9c32-e4503043d0e5) b103dad8-2472-42e9-bf13-0c95a1aa5b49 : 10.20.154.1:9960/1/1/12000/GPU-00000000-0000-000a-149a-0126e8010100,Allocation_id:b103dad8-2472-42e9-bf13-0c95a1aa5b49\n2024-06-19 10:29:28 [DEBUG] Architecture 66 initialization begin.\n2024-06-19 10:29:28 [INFO] \x1b[33mClient get resource list : 10.20.154.1:9960/1/1/12000/GPU-00000000-0000-000a-149a-0126e8010100,Allocation_id:b103dad8-2472-42e9-bf13-0c95a1aa5b49\x1b[0m\n2024-06-19 10:29:28 [INFO] \x1b[33mRPC mode. Because env ORION_ENABLE_LPC is not 1.\x1b[0m\n2024-06-19 10:29:28 [DEBUG] Skip orionrun initialization.\n2024-06-19 10:29:28 [DEBUG] System initialization is done.\n2024-06-19 10:29:28 [INFO] Releasing Orion resource ...\n'

from cudf.

brandon-b-miller avatar brandon-b-miller commented on August 28, 2024

Thanks @blue-cat-whale . I'm still not sure what the issue is yet. I'll need a little time to dig into this, until I have a better answer, can you try setting the following three environment variables as a workaround, and let me know if you're able to import cudf.pandas afterwards?

export PTXCOMPILER_CHECK_NUMBA_CODEGEN_PATCH_NEEDED=0
export PTXCOMPILER_KNOWN_DRIVER_VERSION=12.0
export PTXCOMPILER_KNOWN_RUNTIME_VERSION=12.1

from cudf.

blue-cat-whale avatar blue-cat-whale commented on August 28, 2024

Thanks @blue-cat-whale . I'm still not sure what the issue is yet. I'll need a little time to dig into this, until I have a better answer, can you try setting the following three environment variables as a workaround, and let me know if you're able to import cudf.pandas afterwards?

export PTXCOMPILER_CHECK_NUMBA_CODEGEN_PATCH_NEEDED=0
export PTXCOMPILER_KNOWN_DRIVER_VERSION=12.0
export PTXCOMPILER_KNOWN_RUNTIME_VERSION=12.1
(cudf) [wangyu@localhost test_cuda]$ python tmp_0618.py
b'12 0 12 1\n2024-06-19 10:31:35 [DEBUG] After get lib arch_name=cuda lib_name=liborion_client_common.so, version=0, file_path=/home/wangyu/.orion/lib/cuda4002000/liborion_client_common.so, ret=0.\n2024-06-19 10:31:35 [DEBUG] After get lib arch_name=cuda lib_name=cuda, version=0, file_path=/home/wangyu/.orion/lib/cuda4002000/libcuda.so, ret=0.\n2024-06-19 10:31:35 [DEBUG] Using group resource fb5f4dab-87de-4105-b01d-fd1d10a9cc69\n2024-06-19 10:31:35 [DEBUG] System initialization begin.\n2024-06-19 10:31:35 [DEBUG] Getting Orion resource ...\n2024-06-19 10:31:35 [DEBUG] Checking Unix socket at /var/tmp/orion/comm/orion.sock\n2024-06-19 10:31:35 [DEBUG] Requesting resource through /var/tmp/orion/comm/orion.sock\n2024-06-19 10:31:35 [INFO] Using Orion resource (fb5f4dab-87de-4105-b01d-fd1d10a9cc69) 8040c557-aae9-428b-a221-9ed3145a909a : 10.20.154.1:9960/1/1/12000/GPU-00000000-0000-000a-149a-0126e8010100,Allocation_id:8040c557-aae9-428b-a221-9ed3145a909a\n2024-06-19 10:31:35 [DEBUG] Architecture 66 initialization begin.\n2024-06-19 10:31:35 [INFO] \x1b[33mClient get resource list : 10.20.154.1:9960/1/1/12000/GPU-00000000-0000-000a-149a-0126e8010100,Allocation_id:8040c557-aae9-428b-a221-9ed3145a909a\x1b[0m\n2024-06-19 10:31:35 [INFO] \x1b[33mRPC mode. Because env ORION_ENABLE_LPC is not 1.\x1b[0m\n2024-06-19 10:31:35 [DEBUG] Skip orionrun initialization.\n2024-06-19 10:31:35 [DEBUG] System initialization is done.\n2024-06-19 10:31:35 [INFO] Releasing Orion resource ...\n'

ps. The output in the previous post is also updated. I made a mistake in the old one.

from cudf.

brandon-b-miller avatar brandon-b-miller commented on August 28, 2024

Ok - this makes sense. I think it's fair to treat this is a bug because the way that cuDF parses the cuda versions doesn't account for the possibility of additional stdout and stderr output, and this could be trimmed. I'll put in a PR for this.

As a temporary workaround, I believe the three environment variables above should allow cuDF to import successfully.

from cudf.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.