Git Product home page Git Product logo

Comments (10)

lissahyacinth avatar lissahyacinth commented on August 15, 2024 1

The issue is due to having multiple contexts.

I've put a quick example here - it's got the same requirements as the main example.
https://gist.github.com/lissahyacinth/9379c3f10a1d8ac816a3889c28d825ef

It can be made to crash by changing

fn main() {
    let x = get_cuda_backend();
    test_asum(&x);
    test_nrm2(&x);
    test_axpy(&x);
    test_dot(&x);
    test_copy(&x);
}

to

fn main() {
    let x = get_cuda_backend();
    test_asum(&x);
    let x = get_cuda_backend();
    test_nrm2(&x);
    test_axpy(&x);
    test_dot(&x);
    test_copy(&x);
}

Which is all the info I need to diagnose it as a multiple context issue. Running sequentially isn't sufficient to prevent the tests, as seen above. Our options sit at;

  1. Rewrite Coaster's CUDA Backend to contain CUBLAS and CUDNN code, so that the Backend can properly utilise drop mechanics and we remove the use of global CUDNN/CUBLAS contexts. Currently Coaster is extended in CUBLAS and CUDNN via Coaster-Blas and Coaster-NN, so the Backend cannot be edited in that way, AFAIK.
  2. Use some destructor methods - CTOR is my preferred choice for it, but this doesn't fix our tests, because the destructor isn't run at the start/end of each test.
  3. Diagnose using multiple CUDA handles - this shouldn't be crashing according to the docs, but it is.
  4. Accept not being able to have multiple contexts, and rewrite CUBLAS tests with a custom test runner. This comes with the cost of using a Rust feature and having to move to nightly, as it isn't stablised, but it means that we can run tests again.

Anything else you can think of @drahnr ?

from juice.

lissahyacinth avatar lissahyacinth commented on August 15, 2024 1

The mix of 1 & 2 will fix the majority of our issues here, but we will still need to run the test in sequence. This isn't really a big concern, it's just adding serial to the tests that are using the CUDA backend.

Would you have time to pull it into a microcrate?

from juice.

drahnr avatar drahnr commented on August 15, 2024

@lissahyacinth are these related to #88 ?

from juice.

paulkass avatar paulkass commented on August 15, 2024

what version of CUDA did you use?

from juice.

drahnr avatar drahnr commented on August 15, 2024

Currently I can't say, the latest avail on fedora rpmfusion

from juice.

paulkass avatar paulkass commented on August 15, 2024

from juice.

lissahyacinth avatar lissahyacinth commented on August 15, 2024

Been looking into this today.

CUDA Versioning isn't relevant, can replicate on 9 and 10.2 using AWS for 9 and local for 10.2.

The errors are to do with running multiple CUBLAS functions concurrently rather than sequentially. You can remove the error entirely by running

cargo test -- --test-threads 1

Technically CUDA shouldn't have an issue with multiple contexts, and it should just delay the work sequentially to prevent issues. When I looked at the errors they were focused on the context not being available;

========= Program hit cudaErrorContextIsDestroyed (error 709) due to "context is destroyed" on CUDA API call to cudaEventQuery.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:C:\WINDOWS\SYSTEM32\nvcuda.dll (cuProfilerStop + 0x118a92) [0x2ef442]
=========     Host Frame:C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\bin\cublas64_10.dll (cublasGemmStridedBatchedEx + 0x79d3) [0x299cb3]
=========     Host Frame:C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\bin\cublas64_10.dll [0x1d5d]
=========     Host Frame:C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\bin\cublas64_10.dll [0x338e]
=========     Host Frame:C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\bin\cublas64_10.dll (cublasSetLoggerCallback + 0xbb1) [0x40c21]
=========     Host Frame:D:\BitBucket\juice\target\debug\deps\blas_specs-97d750f4fe887698.exe (rcublas::API::ffi_snrm2 + 0x74) [0x60264]
=========     Host Frame:D:\BitBucket\juice\target\debug\deps\blas_specs-97d750f4fe887698.exe (rcublas::API::nrm2 + 0xaf) [0x601df]
=========     Host Frame:D:\BitBucket\juice\target\debug\deps\blas_specs-97d750f4fe887698.exe (rcublas::api::context::Context::nrm2 + 0x75) [0x5f985]
=========     Host Frame:D:\BitBucket\juice\target\debug\deps\blas_specs-97d750f4fe887698.exe (coaster_blas::frameworks::cuda::{{impl}}::nrm2 + 0x363) [0x58723]
=========     Host Frame:D:\BitBucket\juice\target\debug\deps\blas_specs-97d750f4fe887698.exe (blas_specs::test_nrm2<f32,coaster::frameworks::cuda::Cuda> + 0x70) [0x5870]
=========     Host Frame:D:\BitBucket\juice\target\debug\deps\blas_specs-97d750f4fe887698.exe (blas_specs::cuda_f32::it_computes_correct_nrm2 + 0x18) [0x38c8]
=========     Host Frame:D:\BitBucket\juice\target\debug\deps\blas_specs-97d750f4fe887698.exe (blas_specs::cuda_f32::it_computes_correct_nrm2::{{closure}} + 0xe) [0x379e]
=========     Host Frame:D:\BitBucket\juice\target\debug\deps\blas_specs-97d750f4fe887698.exe (core::ops::function::FnOnce::call_once<closure-0,()> + 0x1b) [0x6d7b]
=========     Host Frame:D:\BitBucket\juice\target\debug\deps\blas_specs-97d750f4fe887698.exe (alloc::boxed::{{impl}}::call_once<(),FnOnce<()>> + 0x57) [0x17417]
=========     Host Frame:D:\BitBucket\juice\target\debug\deps\blas_specs-97d750f4fe887698.exe (panic_unwind::__rust_maybe_catch_panic + 0x22) [0xa9292]
=========     Host Frame:D:\BitBucket\juice\target\debug\deps\blas_specs-97d750f4fe887698.exe (test::run_test::run_test_inner::{{closure}} + 0x420) [0x35dd0]
=========     Host Frame:D:\BitBucket\juice\target\debug\deps\blas_specs-97d750f4fe887698.exe (std::sys_common::backtrace::__rust_begin_short_backtrace<closure-0,()> + 0x26) [0x8546]
=========     Host Frame:D:\BitBucket\juice\target\debug\deps\blas_specs-97d750f4fe887698.exe (std::panicking::try::do_call<std::panic::AssertUnwindSafe<closure-0>,()> + 0x26) [0xe726]
=========     Host Frame:D:\BitBucket\juice\target\debug\deps\blas_specs-97d750f4fe887698.exe (panic_unwind::__rust_maybe_catch_panic + 0x22) [0xa9292]
=========     Host Frame:D:\BitBucket\juice\target\debug\deps\blas_specs-97d750f4fe887698.exe (core::ops::function::FnOnce::call_once<closure-0,()> + 0xd5) [0xf4b5]
=========     Host Frame:D:\BitBucket\juice\target\debug\deps\blas_specs-97d750f4fe887698.exe (alloc::boxed::{{impl}}::call_once<(),FnOnce<()>> + 0x57) [0x903c7]
=========     Host Frame:D:\BitBucket\juice\target\debug\deps\blas_specs-97d750f4fe887698.exe (std::sys::windows::thread::{{impl}}::new::thread_start + 0x77) [0xa7dc7]
=========     Host Frame:C:\WINDOWS\System32\KERNEL32.DLL (BaseThreadInitThunk + 0x14) [0x17bd4]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x21) [0x6ced1]

Which could be an issue with multiple contexts in parallel.

from juice.

drahnr avatar drahnr commented on August 15, 2024

It would be interesting to see when the cublas handle is being dropped (and which one, so one sees if they are all the same or not)

from juice.

drahnr avatar drahnr commented on August 15, 2024
  1. Would it help if we extract the CUDA backend initialization into a common micro-crate? So it only gets initialized once across all of coaster,greenglas,juice and the components of coaster?

  2. I'd prefer to use this combined with no. 1

  3. this sounds very time intense and it's not like we can be sure if we find an issue nvidia will care at all - no matter if it is an issue with documentation or an actual bug

  4. moving to nightly is not an option at this point, it's just too much friction being introduced with nightly for very little benefits at this point

from juice.

drahnr avatar drahnr commented on August 15, 2024

As it seems I can do it probably next weekend / early next week. The serial execution can easily be achieved with concourse, it's literally change one parameter on a bigger scale and setting -j1 for the test execution part :)

from juice.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.