Git Product home page Git Product logo

Comments (5)

mashu avatar mashu commented on June 16, 2024 1

I did few more tests and

model trains cuda driver cudnn kernel driver binary builder
diffusion_mnist no 11.8 8.30.2 (for CUDA 11.5.0) 520.61.5 (for CUDA 11.8) yes
diffusion_mnist no 11.8 8.60.0 (for CUDA 11.8.0) 520.61.5 (for CUDA 11.8) no
vae_mnist yes 11.8 8.30.2 (for CUDA 11.5.0) 520.61.5 (for CUDA 11.8) yes
vae_mnist yes 11.8 8.60.0 (for CUDA 11.8.0) 520.61.5 (for CUDA 11.8) no
vgg_cifar10 yes/no* 11.8 8.30.2 (for CUDA 11.5.0) 520.61.5 (for CUDA 11.8) yes
vgg_cifar10 yes 11.8 8.60.0 (for CUDA 11.8.0) 520.61.5 (for CUDA 11.8) no
conv_minst yes 11.8 8.30.2 (for CUDA 11.5.0) 520.61.5 (for CUDA 11.8) yes
conv_minst yes 11.8 8.60.0 (for CUDA 11.8.0) 520.61.5 (for CUDA 11.8) no

Regarding vgg_cifar10. First time I started julia session and updated packages, I got CUDNN_STATUS_INTERNAL_ERROR (code 4). But after restating Julia session and trying again, this model trains just fine. This is not the case with diffusion_mnist which never works and throws CUDNN_STATUS_BAD_PARAM (code 3).

diffusion_mnist trains on CPU also fine.

from model-zoo.

mashu avatar mashu commented on June 16, 2024 1

The model is broken because it uses negative padding which is not supported on GPU
MWE below

tconv3=ConvTranspose((3, 3), 64 => 32, stride=2,pad=(0, -1, 0, -1), bias=false) |> gpu
X = randn(28,28,64,32) |> gpu
tconv3(X)
ERROR: CUDNNError: CUDNN_STATUS_BAD_PARAM (code 3)
Stacktrace:
  [1] throw_api_error(res::CUDA.CUDNN.cudnnStatus_t)

But CPU version works as author indended

[size(ConvTranspose((3,3), 64=>32, pad=i)(X)) for i in [1,0,-1,-2]]
 (28, 28, 32, 32)
 (30, 30, 32, 32)
 (32, 32, 32, 32)
 (34, 34, 32, 32)
 [size(Conv((3,3), 64=>32, pad=i)(X)) for i in [1,0,-1,-2]]
  (28, 28, 32, 32)
 (26, 26, 32, 32)
 (24, 24, 32, 32)
 (22, 22, 32, 32)

I don't know if this is same in the original model though, I doubt so, because PyTorch does not support negative padding (despite there has been discussions about supporting it).

torch.nn.Conv2d(64,32,(3,3),stride=2,padding=(-1,-1)).to("cuda")(X)
RuntimeError: negative padding is not supported

and not supported by NVIDIA CUDA
https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnSetConvolutionNdDescriptor

CUDNN_STATUS_BAD_PARAM
- One of the elements of padA is strictly negative.

from model-zoo.

mashu avatar mashu commented on June 16, 2024

Some additional information

OpenGL renderer string: NVIDIA GeForce RTX 3080/PCIe/SSE2
OpenGL core profile version string: 4.6.0 NVIDIA 510.85.02
OpenGL core profile shading language version string: 4.60 NVIDIA
mateusz@debian:~/model-zoo/vision/diffusion_mnist$ dpkg -l | grep nvidia
ii  glx-alternative-nvidia                     1.2.1                              amd64        allows the selection of NVIDIA as GLX provider
ii  libegl-nvidia0:amd64                       510.85.02-2                        amd64        NVIDIA binary EGL library
ii  libgl1-nvidia-glvnd-glx:amd64              510.85.02-2                        amd64        NVIDIA binary OpenGL/GLX library (GLVND variant)
ii  libgles-nvidia1:amd64                      510.85.02-2                        amd64        NVIDIA binary OpenGL|ES 1.x library
ii  libgles-nvidia2:amd64                      510.85.02-2                        amd64        NVIDIA binary OpenGL|ES 2.x library
ii  libglx-nvidia0:amd64                       510.85.02-2                        amd64        NVIDIA binary GLX library
ii  libnvidia-allocator1:amd64                 510.85.02-2                        amd64        NVIDIA allocator runtime library
ii  libnvidia-cfg1:amd64                       510.85.02-2                        amd64        NVIDIA binary OpenGL/GLX configuration library
ii  libnvidia-egl-gbm1:amd64                   1.1.0-1                            amd64        GBM EGL external platform library for NVIDIA
ii  libnvidia-egl-wayland1:amd64               1:1.1.10-1                         amd64        Wayland EGL External Platform library -- shared library
ii  libnvidia-eglcore:amd64                    510.85.02-2                        amd64        NVIDIA binary EGL core libraries
ii  libnvidia-encode1:amd64                    510.85.02-2                        amd64        NVENC Video Encoding runtime library
ii  libnvidia-glcore:amd64                     510.85.02-2                        amd64        NVIDIA binary OpenGL/GLX core libraries
ii  libnvidia-glvkspirv:amd64                  510.85.02-2                        amd64        NVIDIA binary Vulkan Spir-V compiler library
ii  libnvidia-ml1:amd64                        510.85.02-2                        amd64        NVIDIA Management Library (NVML) runtime library
ii  libnvidia-ptxjitcompiler1:amd64            510.85.02-2                        amd64        NVIDIA PTX JIT Compiler library
ii  libnvidia-rtcore:amd64                     510.85.02-2                        amd64        NVIDIA binary Vulkan ray tracing (rtcore) library
ii  nvidia-alternative                         510.85.02-2                        amd64        allows the selection of NVIDIA as GLX provider
ii  nvidia-driver                              510.85.02-2                        amd64        NVIDIA metapackage
ii  nvidia-driver-bin                          510.85.02-2                        amd64        NVIDIA driver support binaries
ii  nvidia-driver-libs:amd64                   510.85.02-2                        amd64        NVIDIA metapackage (OpenGL/GLX/EGL/GLES libraries)
ii  nvidia-egl-common                          510.85.02-2                        amd64        NVIDIA binary EGL driver - common files
ii  nvidia-egl-icd:amd64                       510.85.02-2                        amd64        NVIDIA EGL installable client driver (ICD)
ii  nvidia-installer-cleanup                   20220217+1                         amd64        cleanup after driver installation with the nvidia-installer
ii  nvidia-kernel-common                       20220217+1                         amd64        NVIDIA binary kernel module support files
ii  nvidia-kernel-dkms                         510.85.02-2                        amd64        NVIDIA binary kernel module DKMS source
ii  nvidia-kernel-support                      510.85.02-2                        amd64        NVIDIA binary kernel module support files
ii  nvidia-legacy-check                        510.85.02-2                        amd64        check for NVIDIA GPUs requiring a legacy driver
ii  nvidia-modprobe                            515.48.07-1                        amd64        utility to load NVIDIA kernel modules and create device nodes
ii  nvidia-persistenced                        470.129.06-1                       amd64        daemon to maintain persistent software state in the NVIDIA driver
ii  nvidia-powerd                              510.85.02-1                        amd64        NVIDIA Dynamic Boost (daemon)
ii  nvidia-settings                            510.85.02-1                        amd64        tool for configuring the NVIDIA graphics driver
ii  nvidia-smi                                 510.85.02-2                        amd64        NVIDIA System Management Interface
ii  nvidia-support                             20220217+1                         amd64        NVIDIA binary graphics driver support files
ii  nvidia-vaapi-driver:amd64                  0.0.5-1+b1                         amd64        VA-API implementation that uses NVDEC as a backend
ii  nvidia-vdpau-driver:amd64                  510.85.02-2                        amd64        Video Decode and Presentation API for Unix - NVIDIA driver
ii  nvidia-vulkan-common                       510.85.02-2                        amd64        NVIDIA Vulkan driver - common files
ii  nvidia-vulkan-icd:amd64                    510.85.02-2                        amd64        NVIDIA Vulkan installable client driver (ICD)
ii  xserver-xorg-video-nvidia                  510.85.02-2                        amd64        NVIDIA binary Xorg driver
(diffusion_mnist) pkg> st
Status `~/model-zoo/vision/diffusion_mnist/Project.toml`
  [fbb218c0] BSON v0.3.5
  [052768ef] CUDA v3.12.0
  [0c46a032] DifferentialEquations v7.5.0
  [634d3b9d] DrWatson v2.10.0
  [5789e2e9] FileIO v1.15.0
  [587475ba] Flux v0.13.6
  [916415d5] Images v0.25.2
⌅ [eb30cadb] MLDatasets v0.6.0
  [d96e819e] Parameters v0.12.3
  [91a5bcdd] Plots v1.35.2
  [92933f4c] ProgressMeter v1.7.2
  [899adc3e] TensorBoardLogger v0.1.19
  [56ddb016] Logging
  [9a3f8284] Random
Info Packages marked with ⌅ have new versions available but compatibility constraints restrict them from upgrading. To see why use `status --outdated`

(diffusion_mnist) pkg> status --outdated
Status `~/model-zoo/vision/diffusion_mnist/Project.toml`
⌅ [eb30cadb] MLDatasets v0.6.0 (<v0.7.5) [compat]

julia> using CUDA
julia> CUDA.version()
v"11.6.0"
julia> CUDA.CUDNN.version()
v"8.30.2"

julia> using Flux

julia> x = rand(1000) |> gpu
1000-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
 0.12327004
 0.77609754
 0.97980183
 ...

from model-zoo.

mashu avatar mashu commented on June 16, 2024

Follow up on this

  1. NVIDIA drivers
    I removed Debian's 11.6 drivers and upgraded to 11.8 drivers with matching CUDNN from Nvidia website according to their instructions for Debian 11 ( https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#debian-installation)
    Which boils down to adding repo and installing apt install cuda libcudnn8
julia> CUDA.versioninfo()
  Downloaded artifact: CUDA
CUDA toolkit 11.7, artifact installation
NVIDIA driver 520.61.5, for CUDA 11.8
CUDA driver 11.8

Libraries: 
- CUBLAS: 11.10.1
- CURAND: 10.2.10
- CUFFT: 10.7.2
- CUSOLVER: 11.3.5
- CUSPARSE: 11.7.3
- CUPTI: 17.0.0
- NVML: 11.0.0+520.61.5
  Downloaded artifact: CUDNN
- CUDNN: 8.30.2 (for CUDA 11.5.0)
  Downloaded artifact: CUTENSOR
- CUTENSOR: 1.4.0 (for CUDA 11.5.0)

Toolchain:
- Julia: 1.8.2
- LLVM: 13.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

1 device:
  0: NVIDIA GeForce RTX 3080 (sm_86, 9.198 GiB / 10.000 GiB available)

I was still getting error and suspected mismatch between 11.7 artifacts which is most recent provided by CUDA.jl and system's driver (however this shouldn't cause issues).

  1. Disabled CUDA.jl's artifacts
    I disabled binary builder by starting Julia with
JULIA_CUDA_USE_BINARYBUILDER=false julia

Now I am using system wide CUDA toolkit and CUDNN libraries, both are matching according to

julia> CUDA.versioninfo()
CUDA toolkit 11.8, local installation
NVIDIA driver 520.61.5, for CUDA 11.8
CUDA driver 11.8

Libraries: 
- CUBLAS: 11.11.3
- CURAND: 10.3.0
- CUFFT: 10.9.0
- CUSOLVER: 11.4.1
- CUSPARSE: 11.7.5
- CUPTI: 18.0.0
- NVML: 11.0.0+520.61.5
- CUDNN: 8.60.0 (for CUDA 11.8.0)
- CUTENSOR: 1.6.1 (for CUDA 11.8.0)

Toolchain:
- Julia: 1.8.2
- LLVM: 13.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

Environment:
- JULIA_CUDA_USE_BINARYBUILDER: false

1 device:
  0: NVIDIA GeForce RTX 3080 (sm_86, 9.194 GiB / 10.000 GiB available)

Still the same error.

  1. CPU training works therefore debug CUDNN
    With some help on Slack we concluded that error actually is not from artifacts API mismatch. Given that model trains just fine on CPU, some wrapper in CUDA.jl is suspect.
    I tried training with
JULIA_DEBUG=CUDNN JULIA_CUDA_USE_BINARYBUILDER=false julia

Which produced following trace

julia> train()
[ Info: Training on GPU
[ Info: Start Training, total 50 epochs
[ Info: Epoch 1
┌ Debug: CuDNN (v8600) function cudnnGetVersion() called:
│ Time: 2022-10-06T08:51:43.469044 (0d+0h+0m+57s since start)
│ Process=10202; Thread=10202; GPU=NULL; Handle=NULL; StreamId=NULL.
└ @ CUDA.CUDNN ~/.julia/packages/CUDA/DfvRa/lib/cudnn/CUDNN.jl:136
┌ Debug: CuDNN (v8600) function cudnnCreateConvolutionDescriptor() called:
│     convDesc: location=host; addr=0x7f0fdf75d340;
│ Time: 2022-10-06T08:51:43.625100 (0d+0h+0m+57s since start)
│ Process=10202; Thread=10202; GPU=NULL; Handle=NULL; StreamId=NULL.
└ @ CUDA.CUDNN ~/.julia/packages/CUDA/DfvRa/lib/cudnn/CUDNN.jl:136
ERROR: ┌ Debug: CuDNN (v8600) function cudnnSetConvolutionNdDescriptor() called:
│     convDesc: location=host; addr=0x4b73bbf0;
│     arrayLength: type=int; val=2;
│     padA: type=int; val=[0,0];
│     strideA: type=int; val=[1,1];
│     dilationA: type=int; val=[1,1];
│     mode: type=cudnnConvolutionMode_t; val=CUDNN_CONVOLUTION (0);
│     dataType: type=cudnnDataType_t; val=CUDNN_DATA_FLOAT (0);
│ Time: 2022-10-06T08:51:43.687233 (0d+0h+0m+57s since start)
│ Process=10202; Thread=10202; GPU=NULL; Handle=NULL; StreamId=NULL.
└ @ CUDA.CUDNN ~/.julia/packages/CUDA/DfvRa/lib/cudnn/CUDNN.jl:136
CUDNNError: ┌ Debug: CuDNN (v8600) function cudnnSetConvolutionMathType() called:
│     convDesc: location=host; addr=0x4b73bbf0;
│     mathType: type=cudnnMathType_t; val=CUDNN_TENSOR_OP_MATH (1);
│ Time: 2022-10-06T08:51:43.687308 (0d+0h+0m+57s since start)
│ Process=10202; Thread=10202; GPU=NULL; Handle=NULL; StreamId=NULL.
└ @ CUDA.CUDNN ~/.julia/packages/CUDA/DfvRa/lib/cudnn/CUDNN.jl:136
CUDNN_STATUS_BAD_PARAM (code 3)
Stacktrace:
  [1] throw_api_error(res::CUDA.CUDNN.cudnnStatus_t)
    @ CUDA.CUDNN ~/.julia/packages/CUDA/DfvRa/lib/cudnn/error.jl:22
  [2] macro expansion
    @ ~/.julia/packages/CUDA/DfvRa/lib/cudnn/error.jl:35 [inlined]
  [3] cudnnSetConvolutionNdDescriptor(convDesc::Ptr{Nothing}, arrayLength::Int32, padA::Vector{Int32}, filterStrideA::Vector{Int32}, dilationA::Vector{Int32}, mode::CUDA.CUDNN.cudnnConvolutionMode_t, computeType::CUDA.CUDNN.cudnnDataType_t)
    @ CUDA.CUDNN ~/.julia/packages/CUDA/DfvRa/lib/utils/call.jl:26
  [4] cudnnSetConvolutionDescriptor(ptr::Ptr{Nothing}, padding::Vector{Int32}, stride::Vector{Int32}, dilation::Vector{Int32}, mode::CUDA.CUDNN.cudnnConvolutionMode_t, dataType::CUDA.CUDNN.cudnnDataType_t, mathType::CUDA.CUDNN.cudnnMathType_t, reorderType::CUDA.CUDNN.cudnnReorderType_t, groupCount::Int32)
    @ CUDA.CUDNN ~/.julia/packages/CUDA/DfvRa/lib/cudnn/convolution.jl:135
  [5] CUDA.CUDNN.cudnnConvolutionDescriptor(::Vector{Int32}, ::Vararg{Any})
    @ CUDA.CUDNN ~/.julia/packages/CUDA/DfvRa/lib/cudnn/descriptors.jl:39
  [6] CUDA.CUDNN.cudnnConvolutionDescriptor(cdims::DenseConvDims{2, 2, 2, 4, 2}, x::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, pad::Tuple{Int64, Int64})
    @ NNlibCUDA ~/.julia/packages/NNlibCUDA/kCpTE/src/cudnn/conv.jl:48
  [7] cudnnConvolutionDescriptorAndPaddedInput
    @ ~/.julia/packages/NNlibCUDA/kCpTE/src/cudnn/conv.jl:43 [inlined]
  [8] ∇conv_data!(dx::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, dy::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, w::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, cdims::DenseConvDims{2, 2, 2, 4, 2}; alpha::Int64, beta::Int64, algo::Int64)
    @ NNlibCUDA ~/.julia/packages/NNlibCUDA/kCpTE/src/cudnn/conv.jl:98
  [9] ∇conv_data!
    @ ~/.julia/packages/NNlibCUDA/kCpTE/src/cudnn/conv.jl:89 [inlined]
 [10] #∇conv_data#198
    @ ~/.julia/packages/NNlib/0QnJJ/src/conv.jl:99 [inlined]
 [11] ∇conv_data
    @ ~/.julia/packages/NNlib/0QnJJ/src/conv.jl:95 [inlined]
 [12] #rrule#318
    @ ~/.julia/packages/NNlib/0QnJJ/src/conv.jl:326 [inlined]
 [13] rrule
    @ ~/.julia/packages/NNlib/0QnJJ/src/conv.jl:316 [inlined]
 [14] rrule
    @ ~/.julia/packages/ChainRulesCore/C73ay/src/rules.jl:134 [inlined]
 [15] chain_rrule
    @ ~/.julia/packages/Zygote/dABKa/src/compiler/chainrules.jl:218 [inlined]
 [16] macro expansion
    @ ~/.julia/packages/Zygote/dABKa/src/compiler/interface2.jl:0 [inlined]
 [17] _pullback
    @ ~/.julia/packages/Zygote/dABKa/src/compiler/interface2.jl:9 [inlined]
 [18] _pullback
    @ ~/.julia/packages/Flux/4k0Ls/src/layers/conv.jl:333 [inlined]
 [19] _pullback(ctx::Zygote.Context{true}, f::ConvTranspose{2, 4, typeof(identity), CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Bool}, args::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer})
    @ Zygote ~/.julia/packages/Zygote/dABKa/src/compiler/interface2.jl:0
 [20] _pullback
    @ ~/model-zoo/vision/diffusion_mnist/diffusion_mnist.jl:140 [inlined]
 [21] _pullback(::Zygote.Context{true}, ::UNet, ::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, ::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer})
    @ Zygote ~/.julia/packages/Zygote/dABKa/src/compiler/interface2.jl:0
 [22] _pullback
    @ ~/model-zoo/vision/diffusion_mnist/diffusion_mnist.jl:185 [inlined]
 [23] _pullback(::Zygote.Context{true}, ::typeof(model_loss), ::UNet, ::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, ::Float32)
    @ Zygote ~/.julia/packages/Zygote/dABKa/src/compiler/interface2.jl:0
 [24] _pullback
    @ ~/model-zoo/vision/diffusion_mnist/diffusion_mnist.jl:176 [inlined]
 [25] _pullback(::Zygote.Context{true}, ::typeof(model_loss), ::UNet, ::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer})
    @ Zygote ~/.julia/packages/Zygote/dABKa/src/compiler/interface2.jl:0
 [26] _pullback
    @ ~/model-zoo/vision/diffusion_mnist/diffusion_mnist.jl:265 [inlined]
 [27] _pullback(::Zygote.Context{true}, ::var"#12#14"{UNet})
    @ Zygote ~/.julia/packages/Zygote/dABKa/src/compiler/interface2.jl:0
 [28] pullback(f::Function, ps::Zygote.Params{Zygote.Buffer{Any, Vector{Any}}})
    @ Zygote ~/.julia/packages/Zygote/dABKa/src/compiler/interface.jl:373
 [29] withgradient(f::Function, args::Zygote.Params{Zygote.Buffer{Any, Vector{Any}}})
    @ Zygote ~/.julia/packages/Zygote/dABKa/src/compiler/interface.jl:123
 [30] train(; kws::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ Main ~/model-zoo/vision/diffusion_mnist/diffusion_mnist.jl:264
 [31] train()
    @ Main ~/model-zoo/vision/diffusion_mnist/diffusion_mnist.jl:222
 [32] top-level scope
    @ REPL[3]:1
 [33] top-level scope
    @ ~/.julia/packages/CUDA/DfvRa/src/initialization.jl:52

from model-zoo.

ToucheSir avatar ToucheSir commented on June 16, 2024

If you're able to narrow it down to the particular conv layer which is throwing the error, we can try to replicate those conditions in a MWE.

from model-zoo.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.