fluxml / nnlib.jl Goto Github PK

Neural Network primitives with multiple backends

License: Other

Julia 100.00%

nnlib.jl's Introduction

NNlib.jl

This package provides a library of functions useful for neural networks, such as softmax, sigmoid, batched multiplication, convolutions and pooling. Many of these are used by Flux.jl, which loads this package, but they may be used independently.

For use with automatic differentiation, this package defines gradients using ChainRules.jl. These will be seen by various packages including Zygote.jl.

GPU support is provided as package extensions (see the ext/ folder). In order to load the extensions, use the imports

using NNlib, CUDA, cuDNN

for CUDA support, or

using NNlib, AMDGPU

for AMDGPU support.

nnlib.jl's People

Contributors

Stargazers

Watchers

Forkers

staticfloat tkelman dfdx skariel carlolucibello iblislin drvi rudrasohan vaibhavdixit02 boathit tejank10 americast avik-pal gustafsson honzabim freeboson simonmandlik pevnak maleadt simondanisch jbrea ayush1999 ityonemo roger-luo jobjob chengchingwen ianbutterworth kristofferc mj10 mbrookhart thebhatman dhairyalgandhi sleort pshashk velix shreyas-kowshik devmotion jumerckx kraftpunk97-zz mythlee mcabbott mprat arhik tanhevg chrisrackauckas bcdarwin oxinabox tejasvaidhyadev sriyash421 adarshkumar712 nirmal-suthar sudo-rushil astupidbear pradeepramadasan n-p cossio gartangh alexander-barth aviks yiyuezhuo mlh-fellowship drchainsaw dboyliao coursekevin pazner radonnachie elijahmathews gxyd jeremiedb simeonschaub denizyuret yuehhua djthegr8 stjordanis torfjelde spazewalker maxfreu standardgalactic bjarthur sailfish009 playfloor mashu tantheta01 rkube darsnack vincentmolin zsoerenm adrhill yichengdwu saransh-cpp zsz00 mloubout theabhirath dkalev scheidan zomborid nikopj jondeuce gabrielpreviato prikmm

nnlib.jl's Issues

Add Documenter to NNlib

It would be awesome for NNlib to have auto-generated documentation like many other wonderful Julia packages (including Flux.jl). I know you can always look at source, but documentation is really nice sometimes!

I would love to take a stab at integrating Documenter to NNlib as part of Hacktoberfest, if folks are willing to be patient with me and help out when I get stuck! I'll start working on a local branch for this and see how it goes.

Do most Julia packages host documentation on Github Pages?

`softmax` and `logsoftmax` fail (too silently) for integer arrays

Ideally, either

julia> using NNlib

julia> xs = rand(1:10,5)
5-element Array{Int64,1}:
 4
 8
 2
 8
 2

julia> softmax(xs)

Error thrown in threaded loop on thread 0: InexactError()5-element Array{Int64,1}:
               1
 140422637785272
 140422637830184
 140422637778824
              10

julia> logsoftmax(xs)

Error thrown in threaded loop on thread 0: InexactError()5-element Array{Int64,1}:
 140422640438016
 140422640438080
 140422657585728
 140422640435408
 140422195557936

should work (types being promoted etc.) properly. At a minimum, a "stronger error" that terminates the program execution should be thrown in this case. At the moment, it's still possible to continue program execution with the wrong results, e.g. sum(softmax(xs)), which is not good.

@fix and fused broadcasts

Continued from JuliaGPU/CuArrays.jl#71

The issue

> x = CuArray([2.]);
> @fix x.*log.(x)
ERROR: Broadcast output type Any is not concrete

Using CUDAnative.log explicitly is fine:

> x.*CUDAnative.log.(x)
1-element CuArray{Float64,1}:
 1.38629

If I expand @fix I get

> @macroexpand @fix x.*log.(x)
quote 
    ##702 = (NNlib._cufunc)(log, (NNlib.cudata).((x,))...)
    x .* ##702.(x)
end

Which evaluates to

> f = CUDAnative.log;
> x.*f.(x)
ERROR: Broadcast output type Any is not concrete

But using only f works.

> f.(x)
1-element CuArray{Float64,1}:
 0.693147

It seems to me there is some issue with fused broadcasts and saving an intrinsic in a variable? Could we change the @fix macro to evaluate (NNlib._cufunc)(log, (NNlib.cudata).((x,))...) when running the macro, and return :(x .* CUDAnative.log.(x)) instead?

Use of pointer based gemm! is unsafe

Ideally, it wouldn't be necessary to pass pointers around as in

NNlib.jl/src/impl/conv.jl

Line 280 in d07ac0b

 gemm!('N','N',M,N,K,alpha,pointer(x2,(j-1)*M*K+1),pointer(w,(j-1)*K*N+1),T(0),pointer(y,yidx)) 

. It would be good to test how much slower a version based on views would be. If it is too costly then it would be best to add some GC.@preserves since, I believe, the current version is an inlining away from segfaulting.

Use better heuristics for NNPACK threadpool selection

Opening this issue to track the implementation of better heuristics for NNPACK

#67 implements simple heuristics for selecting the threadpool while using NNPACK. However, we should, in general, have more robust defaults for the heuristics.
As suggested by @staticfloat we should try to tune the selection criteria specific to the machine installed on.
Some simple improvements are to use nnp_convolution_inference instead of nnp_convolution_output when the batch_size is 1.

Error building `NNlib`

(v1.1) pkg> build NNlib
Building NNlib → C:\Users\GZ\.julia\packages\NNlib\HzCsx\deps\build.log
┌ Error: Error building NNlib:
│ ERROR: LoadError: Your platform ("x86_64-w64-mingw32", parsed as "x86_64-w64-mingw32-gcc4-cxx11") is not supported by
this package!
│ Stacktrace:
│ [1] error(::String) at .\error.jl:33
│ [2] top-level scope at C:\Users\GZ.julia\packages\NNlib\HzCsx\deps\build.jl:30
│ [3] include at .\boot.jl:326 [inlined]
│ [4] include_relative(::Module, ::String) at .\loading.jl:1038
│ [5] include(::Module, ::String) at .\sysimg.jl:29
│ [6] include(::String) at .\client.jl:403
│ [7] top-level scope at none:0
│ in expression starting at C:\Users\GZ.julia\packages\NNlib\HzCsx\deps\build.jl:26
└ @ Pkg.Operations C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.1\Pkg\src
Operations.jl:1075

how to fix the issue?

NNPACK convolutions behaving weirdly!

julia> a = Conv((3, 3), 3 => 1, pad = 1);

julia> x = rand(Float32, 3, 3, 3, 1);

julia> NNlib.conv_nnpack(x, a.weight, NNlib.DenseConvDims(x, a.weight, padding = 1)) # Correct Values
3×3×1×1 Array{Float32,4}:
[:, :, 1, 1] =
  0.742445   0.468205   0.796313
 -0.407806  -1.65078    0.833374
 -1.5611    -0.760315  -0.70605 

julia> NNlib.conv_nnpack(x, a.weight, NNlib.DenseConvDims(x, a.weight, padding = 1)) # Incorrect Values
3×3×1×1 Array{Float32,4}:
[:, :, 1, 1] =
 -0.353632  -1.1626    -2.12827 
  0.504185  -0.364751  -0.083856
  1.39476    0.158737   0.358745

The values keep flipping every 2 calls.

cc @staticfloat

Convolution is not inferable

e.g.

x = rand(10, 10, 3, 2)
w = rand(2, 2, 3, 4)
@code_typed conv(x, w, DenseConvDims(x, w)) # AbstractArray{yT, 4} where yT

@staticfloat any idea what might be going wrong here?

no method matching conv(::Array{Float32,4}, ::Array{Float32,4}; stride=(1, 1), pad=(0, 0), dilation=(1, 1))

This simple code is breaking on latest master.

julia> using Flux, NNlib

julia> c = Conv((2,1), 1 => 1, identity)
Conv((2, 1), 1=>1)

julia> x = randn(10,1,1,11);

julia> c(x)
ERROR: MethodError: no method matching conv(::Array{Float32,4}, ::Array{Float32,4}; stride=(1, 1), pad=(0, 0), dilation=(1, 1))
Closest candidates are:
  conv(::AbstractArray{xT,N}, ::AbstractArray{wT,N}, ::ConvDims; kwargs...) where {xT, wT, N} at /Users/tpevny/.julia/packages/TimerOutputs/7zSea/src/TimerOutput.jl:198
  conv(::AbstractArray, ::TrackedArray; kw...) at /Users/tpevny/.julia/dev/Tracker/src/lib/array.jl:402
  conv(::TrackedArray, ::AbstractArray; kw...) at /Users/tpevny/.julia/dev/Tracker/src/lib/array.jl:403
Stacktrace:

Generalize activation functions to complex input

Hi,

I would find it useful if the activation functions where generalised to handle complex inputs. Many of them are still well-defined in this case, for example, there is no reason for sigmoid to be limited to real values, and it could be easily generalized.

σ(x::Real) = one(x) / (one(x) + exp(-x))
swish(x::Real) = x * σ(x)

softplus is also mathematically well defined, but for an efficient implementation someone with more experience than me could comment on how to rewrite it

softplus(x::Real) = ifelse(x > 0, x + log1p(exp(-x)), log1p(exp(x)))

The main advantage for this is that people like me working with complex-valued neural networks (often encountered in physics) could depend on NNlib and get the GPU versions of those functions with no effort.

Would you accept a PR (and a complimentary PR to CuArrays.jl) for that?

Use oneDNN

For improved CPU performance it would be grand if we could use (optionally) the (open-source) MKL-DNN library https://github.com/intel/mkl-dnn

pointer(CuArray) is not defined

The demos from Metalhead.jl are broken if one tries to run them on the GPU (xref FluxML/Metalhead.jl#42), eg

julia> using CuArrays, Metalhead, Flux
julia> vgg = VGG19() |> gpu
julia> x = cu(rand(Float32, 224, 224, 3, 1));
julia> vgg(x)

The trace points to a an exception in NNlib:

ERROR: conversion to pointer not defined for CuArray{Float32,2}
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] unsafe_convert(::Type{Ptr{Float32}}, ::CuArray{Float32,2}) at ./pointer.jl:67
 [3] pointer(::CuArray{Float32,2}) at ./abstractarray.jl:934
 [4] macro expansion at /home/julieta/.julia/packages/NNlib/mxWRT/src/impl/conv_im2col.jl:54 [inlined]
 [5] macro expansion at ./gcutils.jl:87 [inlined]
 [6] macro expansion at /home/julieta/.julia/packages/NNlib/mxWRT/src/impl/conv_im2col.jl:53 [inlined]
 [7] #conv_im2col!#231(::CuArray{Float32,2}, ::Float32, ::Float32, ::typeof(NNlib.conv_im2col!), ::CuArray{Float32,5}, ::CuArray{Float32,5}, ::Array{Float32,5}, ::DenseConvDims{3,(3, 3, 1),3,64,(1, 1, 1),(1, 1, 1, 1, 0, 0),(1, 1, 1),false}) at /home/julieta/.julia/packages/TimerOutputs/7zSea/src/TimerOutput.jl:190
 [8] conv!(::CuArray{Float32,5}, ::CuArray{Float32,5}, ::Array{Float32,5}, ::DenseConvDims{3,(3, 3, 1),3,64,(1, 1, 1),(1, 1, 1, 1, 0, 0),(1, 1, 1),false}) at /home/julieta/.julia/packages/TimerOutputs/7zSea/src/TimerOutput.jl:198
 [9] #conv!#56(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(conv!), ::CuArray{Float32,4}, ::CuArray{Float32,4}, ::Array{Float32,4}, ::DenseConvDims{2,(3, 3),3,64,(1, 1),(1, 1, 1, 1),(1, 1),false}) at /home/julieta/.julia/packages/NNlib/mxWRT/src/conv.jl:68
 [10] conv!(::CuArray{Float32,4}, ::CuArray{Float32,4}, ::Array{Float32,4}, ::DenseConvDims{2,(3, 3),3,64,(1, 1),(1, 1, 1, 1),(1, 1),false}) at /home/julieta/.julia/packages/NNlib/mxWRT/src/conv.jl:68
 [11] macro expansion at /home/julieta/.julia/packages/NNlib/mxWRT/src/conv.jl:114 [inlined]
 [12] #conv#97(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(conv), ::CuArray{Float32,4}, ::Array{Float32,4}, ::DenseConvDims{2,(3, 3),3,64,(1, 1),(1, 1, 1, 1),(1, 1),false}) at /home/julieta/.julia/packages/TimerOutputs/7zSea/src/TimerOutput.jl:190
 [13] conv(::CuArray{Float32,4}, ::Array{Float32,4}, ::DenseConvDims{2,(3, 3),3,64,(1, 1),(1, 1, 1, 1),(1, 1),false}) at /home/julieta/.julia/packages/TimerOutputs/7zSea/src/TimerOutput.jl:198
 [14] (::Conv{2,2,typeof(relu),Array{Float32,4},Array{Float32,1}})(::CuArray{Float32,4}) at /home/julieta/.julia/packages/Flux/dkJUV/src/layers/conv.jl:55
 [15] applychain(::Tuple{Conv{2,2,typeof(relu),Array{Float32,4},Array{Float32,1}},Conv{2,2,typeof(relu),Array{Float32,4},Array{Float32,1}},MaxPool{2,4},Conv{2,2,typeof(relu),Array{Float32,4},Array{Float32,1}},Conv{2,2,typeof(relu),Array{Float32,4},Array{Float32,1}},MaxPool{2,4},Conv{2,2,typeof(relu),Array{Float32,4},Array{Float32,1}},Conv{2,2,typeof(relu),Array{Float32,4},Array{Float32,1}},Conv{2,2,typeof(relu),Array{Float32,4},Array{Float32,1}},Conv{2,2,typeof(relu),Array{Float32,4},Array{Float32,1}},MaxPool{2,4},Conv{2,2,typeof(relu),Array{Float32,4},Array{Float32,1}},Conv{2,2,typeof(relu),Array{Float32,4},Array{Float32,1}},Conv{2,2,typeof(relu),Array{Float32,4},Array{Float32,1}},Conv{2,2,typeof(relu),Array{Float32,4},Array{Float32,1}},MaxPool{2,4},Conv{2,2,typeof(relu),Array{Float32,4},Array{Float32,1}},Conv{2,2,typeof(relu),Array{Float32,4},Array{Float32,1}},Conv{2,2,typeof(relu),Array{Float32,4},Array{Float32,1}},Conv{2,2,typeof(relu),Array{Float32,4},Array{Float32,1}},MaxPool{2,4},getfield(Metalhead, Symbol("##44#45")),Dense{typeof(relu),LinearAlgebra.Adjoint{Float32,Array{Float32,2}},Array{Float32,1}},Dropout{Float32},Dense{typeof(relu),LinearAlgebra.Adjoint{Float32,Array{Float32,2}},Array{Float32,1}},Dropout{Float32},Dense{typeof(identity),LinearAlgebra.Adjoint{Float32,Array{Float32,2}},Array{Float32,1}},typeof(softmax)}, ::CuArray{Float32,4}) at /home/julieta/.julia/packages/Flux/dkJUV/src/layers/basic.jl:31
 [16] (::Chain{Tuple{Conv{2,2,typeof(relu),Array{Float32,4},Array{Float32,1}},Conv{2,2,typeof(relu),Array{Float32,4},Array{Float32,1}},MaxPool{2,4},Conv{2,2,typeof(relu),Array{Float32,4},Array{Float32,1}},Conv{2,2,typeof(relu),Array{Float32,4},Array{Float32,1}},MaxPool{2,4},Conv{2,2,typeof(relu),Array{Float32,4},Array{Float32,1}},Conv{2,2,typeof(relu),Array{Float32,4},Array{Float32,1}},Conv{2,2,typeof(relu),Array{Float32,4},Array{Float32,1}},Conv{2,2,typeof(relu),Array{Float32,4},Array{Float32,1}},MaxPool{2,4},Conv{2,2,typeof(relu),Array{Float32,4},Array{Float32,1}},Conv{2,2,typeof(relu),Array{Float32,4},Array{Float32,1}},Conv{2,2,typeof(relu),Array{Float32,4},Array{Float32,1}},Conv{2,2,typeof(relu),Array{Float32,4},Array{Float32,1}},MaxPool{2,4},Conv{2,2,typeof(relu),Array{Float32,4},Array{Float32,1}},Conv{2,2,typeof(relu),Array{Float32,4},Array{Float32,1}},Conv{2,2,typeof(relu),Array{Float32,4},Array{Float32,1}},Conv{2,2,typeof(relu),Array{Float32,4},Array{Float32,1}},MaxPool{2,4},getfield(Metalhead, Symbol("##44#45")),Dense{typeof(relu),LinearAlgebra.Adjoint{Float32,Array{Float32,2}},Array{Float32,1}},Dropout{Float32},Dense{typeof(relu),LinearAlgebra.Adjoint{Float32,Array{Float32,2}},Array{Float32,1}},Dropout{Float32},Dense{typeof(identity),LinearAlgebra.Adjoint{Float32,Array{Float32,2}},Array{Float32,1}},typeof(softmax)}})(::CuArray{Float32,4}) at /home/julieta/.julia/packages/Flux/dkJUV/src/layers/basic.jl:33
 [17] (::VGG19)(::CuArray{Float32,4}) at /home/julieta/.julia/packages/Metalhead/1dQOk/src/vgg19.jl:46
 [18] top-level scope at REPL[40]:1

Which in this example comes from the NNlib kernel trying to obtain a pointer from a CuArray

NNlib.jl/src/impl/conv_im2col.jl

Line 54 in 342928e

col_ptr = pointer(col)

I believe to obtain the pointer of a CuArray one now does

julia> pointer(x.buf)
CUDAdrv.CuPtr{Nothing}(0x00007f09ffc40000)

But I have no idea how to rewrite the code to make this compatible between CPU and GPU arrays.

Register v0.6.0

I'd like to release a new version of CuArrays, but that depends on a new version of NNlib.
Would any maintainer here care activating @JuliaRegistrator and cutting a new release?

generalize softmax

the softmax functions should be generalized to handle reduction across any dimensions, e.g.:

softmax(x, dims=2)
softmax(x, dims=(1,3))

The relu definition is a bit "scary" for array inputs.

The following seems like an easy mistake to make:

julia> x = randn(5)
5-element Array{Float64,1}:
 -1.0483606281411753
 -0.13524020825581384
 -2.184303973828124
  0.4308162306092796
 -0.673766167758703

julia> relu(x)
5-element Array{Float64,1}:
 0.0
 0.0
 0.0
 0.0
 0.0

julia> relu.(x)
5-element Array{Float64,1}:
 0.0
 0.0
 0.0
 0.4308162306092796
 0.0

Include Knet in LICENSE file

Acording to a comment at the top of:

NNlib.jl/src/conv.jl

Lines 2 to 3 in 1794fbb

 ## Convolutions on CPU. Almost all the code is borrowed from Knet.jl 

 ## For GPU versions see CUDNN.jl

that code comes originally from KNet.jl
I believe by the terms of the MIT license (of KNet), NNlib.jl needs to acknowledge that in a particular way.
Normally I believe that this is done in the LICENSE.md file.

Separable Convolutions

Separable Convolutions (https://arxiv.org/abs/1610.02357) are really, really nice to have, and a must for fast/realtime image processing. If nobody else does this, I will likely implement an im2col and view-based implementation of these later this summer, but I figured we should track this as they're rapidly becoming a mainstay in efficient (e.g. mobile) computer vision systems.

Winograd/Cook/Toom Convolution

If you don't mind I am opening an issue just to track the possibility of having an implementation of the winograd convolution method that is faster than the traditional im2col + gemm method.
See https://arxiv.org/abs/1509.09308.

Also from one of the authors of that paper:
Maratyszcza/NNPACK#8 (comment)

Does logsoftmax! allow for gradients to be taken in backprop?

When calling NNlib.logsoftmax!(out_array, in_array) where in_array is a TrackedArray and out_array is zeros(size(in_array)) I get the following Error Message Error thrown in threaded loop on thread 0: ErrorException("Can't differentiate `setindex!`"). Furthermore, it then throws a MethodError and begins printing out arrays after arrays after arrays:

Error thrown in threaded loop on thread 0: MethodError(f=Float64, args=(Flux.Tracker.TrackedReal{Float64}(data=-1.85569, tracker=Flux.Tracker.Tracked{Float64}(ref=0x00000000, f=Flux.Tracker.Call{getfield(Flux.Tracker, Symbol("##310#313")), Tuple{Flux.Tracker.Tracked{Float64}, Flux.Tracker.Tracked{Float64}}}(func=getfield(Flux.Tracker, Symbol("##310#313"))(), args=(Flux.Tracker.Tracked{Float64}(ref=0x00000000, f=Flux.Tracker.Call{getfield(Flux.Tracker, Symbol("##310#313")), Tuple{Flux.Tracker.Tracked{Float64}, Flux.Tracker.Tracked{Float64}}}(func=getfield(Flux.Tracker, Symbol("##310#313"))(), args=(Flux.Tracker.Tracked{Float64}(ref=0x00000000, f=Flux.Tracker.Call{getfield(Flux.Tracker, Symbol("##334#336")){Flux.Tracker.TrackedArray{Float64, 2, Array{Float64, 2}}, Tuple{Int64, Int64}}, Tuple{Flux.Tracker.Tracked{Array{Float64, 2}}, Nothing, Nothing}}(func=getfield(Flux.Tracker, Symbol("##334#336")){Flux.Tracker.TrackedArray{Float64, 2, Array{Float64, 2}}, Tuple{Int64, Int64}}(xs=Flux.Tracker.TrackedArray{Float64, 2, Array{Float64, 2}}(tracker=Flux.Tracker.Tracked{Array{Float64, 2}}(ref=0x00000000, f=Flux.Tracker.Call{getfield(Flux.Tracker, Symbol("#back#451")){2, getfield(Base.Broadcast, Symbol("##26#28")){getfield(Base.Broadcast, Symbol("##27#29")){typeof(Base.:(+)), getfield(Base.Broadcast, Symbol("##9#10")){getfield(Base.Broadcast, Symbol("##9#10")){getfield(Base.Broadcast, Symbol("##11#12"))}}, getfield(Base.Broadcast, Symbol("##13#14")){getfield(Base.Broadcast, Symbol("##13#14")){getfield(Base.Broadcast, Symbol("##15#16"))}}, getfield(Base.Broadcast, Symbol("##5#6")){getfield(Base.Broadcast, Symbol("##5#6")){getfield(Base.Broadcast, Symbol("##3#4"))}}}, typeof(Base.tanh)}, Tuple{Flux.Tracker.TrackedArray{Float64, 2, Array{Float64, 2}}, Flux.Tracker.TrackedArray{Float64, 1, Array{Float64, 1}}}}, Tuple{Flux.Tracker.Tracked{Array{Float64, 2}}, Flux.Tracker.Tracked{Array{Float64, 1}}}}(func=getfield(Flux.Tracker, Symbol("#back#451")){2, getfield(Base.Broadcast, Symbol("##26#28")){getfield(Base.Broadcast, Symbol("##27#29")){typeof(Base.:(+)), getfield(Base.Broadcast, Symbol("##9#10")){getfield(Base.Broadcast, Symbol("##9#10")){getfield(Base.Broadcast, Symbol("##11#12"))}}, getfield(Base.Broadcast, Symbol("##13#14")){getfield(Base.Broadcast, Symbol("##13#14")){getfield(Base.Broadcast, Symbol("##15#16"))}}, getfield(Base.Broadcast, Symbol("##5#6")){getfield(Base.Broadcast, Symbol("##5#6")){getfield(Base.Broadcast, Symbol("##3#4"))}}}, typeof(Base.tanh)}, Tuple{Flux.Tracker.TrackedArray{Float64, 2, Array{Float64, 2}}, Flux.Tracker.TrackedArray{Float64, 1, Array{Float64, 1}}}}(f=getfield(Base.Broadcast, Symbol("##26#28")){getfield(Base.Broadcast, Symbol("##27#29")){typeof(Base.:(+)), getfield(Base.Broadcast, Symbol("##9#10")){getfield(Base.Broadcast, Symbol("##9#10")){getfield(Base.Broadcast, Symbol("##11#12"))}}, getfield(Base.Broadcast, Symbol("##13#14")){getfield(Base.Broadcast, Symbol("##13#14")){getfield(Base.Broadcast, Symbol("##15#16"))}}, getfield(Base.Broadcast, Symbol("##5#6")){getfield(Base.Broadcast, Symbol("##5#6")){getfield(Base.Broadcast, Symbol("##3#4"))}}}, typeof(Base.tanh)}(makeargs=getfield(Base.Broadcast, Symbol("##27#29")){typeof(Base.:(+)), getfield(Base.Broadcast, Symbol("##9#10")){getfield(Base.Broadcast, Symbol("##9#10")){getfield(Base.Broadcast, Symbol("##11#12"))}}, getfield(Base.Broadcast, Symbol("##13#14")){getfield(Base.Broadcast, Symbol("##13#14")){getfield(Base.Broadcast, Symbol("##15#16"))}}, getfield(Base.Broadcast, Symbol("##5#6")){getfield(Base.Broadcast, Symbol("##5#6")){getfield(Base.Broadcast, Symbol("##3#4"))}}}(f=typeof(Base.:(+))(), headargs=getfield(Base.Broadcast, Symbol("##9#10")){getfield(Base.Broadcast, Symbol("##9#10")){getfield(Base.Broadcast, Symbol("##11#12"))}}(headargs=getfield(Base.Broadcast, Symbol("##9#10")){getfield(Base.Broadcast, Symbol("##11#12"))}(headargs=getfield(Base.Broadcast, Symbol("##11#12"))())), tailargs=getfield(Base.Broadcast, Symbol("##13#14")){getfield(Base.Broadcast, Symbol("##13#14")){getfield(Base.Broadcast, Symbol("##15#16"))}}(tailargs=getfield(Base.Broadcast, Symbol("##13#14")){getfield(Base.Broadcast, Symbol("##15#16"))}(tailargs=getfield(Base.Broadcast, Symbol("##15#16"))())), #3#makeargs=getfield(Base.Broadcast, Symbol("##5#6")){getfield(Base.Broadcast, Symbol("##5#6")){getfield(Base.Broadcast, Symbol("##3#4"))}}(#641#makeargs=getfield(Base.Broadcast, Symbol("##5#6")){getfield(Base.Broadcast, Symbol("##3#4"))}(#641#makeargs=getfield(Base.Broadcast, Symbol("##3#4"))()))), f=typeof(Base.tanh)()), args=(Flux.Tracker.TrackedArray{Float64, 2, Array{Float64, 2}}(tracker=Flux.Tracker.Tracked{Array{Float64, 2}}(ref=0x00000000, f=Flux.Tracker.Call{getfield(Flux.Tracker, Symbol("##424#425")){Flux.Tracker.TrackedArray{Float64, 2, Array{Float64, 2}}, Flux.Tracker.TrackedArray{Float64, 2, Array{Float64, 2}}}, Tuple{Flux.Tracker.Tracked{Array{Float64, 2}}, Flux.Tracker.Tracked{Array{Float64, 2}}}}(func=getfield(Flux.Tracker, Symbol("##424#425")){Flux.Tracker.TrackedArray{Float64, 2, Array{Float64, 2}}, Flux.Tracker.TrackedArray{Float64, 2, Array{Float64, 2}}}(a=Flux.Tracker.TrackedArray{Float64, 2, Array{Float64, 2}}(tracker=Flux.Tracker.Tracked{Array{Float64, 2}}(ref=0x00000000, f=Flux.Tracker.Call{Nothing, Tuple{}}(func=nothing, args=()), isleaf=true, grad=Array{Float64, (4, 64)}[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), data=Array{Float64, (4, 64)}[-0.116073, 0.231022,...

The ellipses at the end is where the "arrays after arrays after arrays" begins. In any case, I'm just wondering if this is expected given my implementation? If not, would love to get your thoughts on what could be going wrong.

conv with No Dilation

Hi! I'm wondering if dilation is required for conv layers. When I use the Flux constructor

Conv((3, 3), 1=>32, dilation=0)

I get this error:

ArgumentError: step cannot be zero

Stacktrace:
 [1] steprange_last(::Int64, ::Int64, ::Int64) at ./range.jl:210
 [2] Type at ./range.jl:200 [inlined]
 [3] Type at ./range.jl:251 [inlined]
 [4] _colon at ./range.jl:24 [inlined]
 [5] (::Colon)(::Int64, ::Int64, ::Int64) at ./range.jl:22
 [6] #conv_direct!#209(::Float64, ::Bool, ::typeof(NNlib.conv_direct!), ::Array{Float64,5}, ::Array{Float64,5}, ::Array{Float32,5}, ::DenseConvDims{3,(3, 3, 1),1,32,(1, 1, 1),(0, 0, 0, 0, 0, 0),(0, 0, 1),false}) at /Users/jonathan/.julia/packages/NNlib/mxWRT/src/impl/conv_direct.jl:81
 [7] conv_direct! at /Users/jonathan/.julia/packages/TimerOutputs/7zSea/src/TimerOutput.jl:198 [inlined]
 [8] #conv!#91(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(conv!), ::Array{Float64,5}, ::Array{Float64,5}, ::Array{Float32,5}, ::DenseConvDims{3,(3, 3, 1),1,32,(1, 1, 1),(0, 0, 0, 0, 0, 0),(0, 0, 1),false}) at /Users/jonathan/.julia/packages/NNlib/mxWRT/src/conv.jl:97
 [9] conv!(::Array{Float64,5}, ::Array{Float64,5}, ::Array{Float32,5}, ::DenseConvDims{3,(3, 3, 1),1,32,(1, 1, 1),(0, 0, 0, 0, 0, 0),(0, 0, 1),false}) at /Users/jonathan/.julia/packages/NNlib/mxWRT/src/conv.jl:95
 [10] #conv!#56(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(conv!), ::Array{Float64,4}, ::Array{Float64,4}, ::Array{Float32,4}, ::DenseConvDims{2,(3, 3),1,32,(1, 1),(0, 0, 0, 0),(0, 0),false}) at /Users/jonathan/.julia/packages/NNlib/mxWRT/src/conv.jl:68
 [11] conv!(::Array{Float64,4}, ::Array{Float64,4}, ::Array{Float32,4}, ::DenseConvDims{2,(3, 3),1,32,(1, 1),(0, 0, 0, 0),(0, 0),false}) at /Users/jonathan/.julia/packages/NNlib/mxWRT/src/conv.jl:68
...

It seemed like this was an issue with NNLib so I decided to make the issue here. This is with NNLib version 0.6.0.

NNlib prints "loaded" when DataFrames is loaded

NNlib.jl/src/numeric.jl

Line 91 in 8546f3c

 @init @require DataFrames="a93c6f00-e57d-5684-b7b6-d8193f3e46c0" println("loaded") 

Not sure what the point is.

Problems with new convolution

Hello,

when I play with a latest version of convolution operator, I am getting a following error. MWE is attached.

julia> c = Conv((20,1), 1 => 1, identity)
Conv((20, 1), 1=>1)

julia> x = randn(100,1,1,11);

julia> c(x)
ERROR: DimensionMismatch("new dimensions (81, -18, 20, 1, 1) must be consistent with array size 1620")
Stacktrace:
 [1] (::getfield(Base, Symbol("#throw_dmrsa#193")))(::NTuple{5,Int64}, ::Int64) at ./reshapedarray.jl:41
 [2] reshape at ./reshapedarray.jl:45 [inlined]
 [3] im2col_2d!(::SubArray{Float32,3,Array{Float32,4},Tuple{Base.Slice{Base.OneTo{Int64}},Base.Slice{Base.OneTo{Int64}},Base.Slice{Base.OneTo{Int64}},Int64},true}, ::Array{Float32,2}, ::NNlib.ConvDims{(100, 1),(20, 1),1,(1, 1),(0, 0, 0, 0),(1, 1),true}) at /Users/tpevny/.julia/packages/NNlib/UpABH/src/impl/conv.jl:47
 [4] #conv2d!#39(::Float32, ::Function, ::Array{Float32,4}, ::Array{Float32,4}, ::Array{Float32,4}, ::NNlib.ConvDims{(100, 1),(20, 1),1,(1, 1),(0, 0, 0, 0),(1, 1),true}) at /Users/tpevny/.julia/packages/NNlib/UpABH/src/impl/conv.jl:349
 [5] (::getfield(NNlib, Symbol("#kw##conv2d!")))(::NamedTuple{(:alpha,),Tuple{Float32}}, ::typeof(NNlib.conv2d!), ::Array{Float32,4}, ::Array{Float32,4}, ::Array{Float32,4}, ::NNlib.ConvDims{(100, 1),(20, 1),1,(1, 1),(0, 0, 0, 0),(1, 1),true}) at ./none:0
 [6] #conv2d!#40(::Tuple{Int64,Int64}, ::Tuple{Int64,Int64}, ::Tuple{Int64,Int64}, ::Int64, ::Float32, ::Function, ::Array{Float32,4}, ::Array{Float32,4}, ::Array{Float32,4}) at /Users/tpevny/.julia/packages/NNlib/UpABH/src/impl/conv.jl:373
 [7] #conv2d! at ./none:0 [inlined]
 [8] #conv!#68 at /Users/tpevny/.julia/packages/NNlib/UpABH/src/conv.jl:118 [inlined]
 [9] #conv! at ./none:0 [inlined]
 [10] #conv#54(::Nothing, ::Tuple{Int64,Int64}, ::Tuple{Int64,Int64}, ::Tuple{Int64,Int64}, ::Function, ::Array{Float32,4}, ::Array{Float32,4}) at /Users/tpevny/.julia/packages/NNlib/UpABH/src/conv.jl:62
 [11] #conv at ./none:0 [inlined]
 [12] #_forward#502 at /Users/tpevny/.julia/dev/Tracker/src/lib/array.jl:405 [inlined]
 [13] #_forward at ./none:0 [inlined]
 [14] #track#1 at /Users/tpevny/.julia/dev/Tracker/src/Tracker.jl:51 [inlined]
 [15] #track at ./none:0 [inlined]
 [16] #conv#500 at /Users/tpevny/.julia/dev/Tracker/src/lib/array.jl:402 [inlined]
 [17] #conv at ./none:0 [inlined]
 [18] Conv at /Users/tpevny/.julia/packages/Flux/zNlBL/src/layers/conv.jl:53 [inlined]
 [19] Conv at /Users/tpevny/.julia/packages/Flux/zNlBL/src/layers/conv.jl:63 [inlined]
 [20] (::Conv{2,typeof(identity),TrackedArray{…,Array{Float32,4}},TrackedArray{…,Array{Float32,1}}})(::Array{Float64,4}) at /Users/tpevny/.julia/packages/Flux/zNlBL/src/layers/conv.jl:66
 [21] top-level scope at none:0

I have NNlib v0.5.0 (I have tried this with master as well). With v0.4.3 this works, but I cannot calculate the gradient in this case

julia> Flux.Tracker.gradient(x -> sum(c(x)), x)
ERROR: MethodError: no method matching ∇conv_data(::Array{Float32,4}, ::Array{Float32,4}; size=(100, 1, 1, 11), stride=(1, 1), pad=(0, 0), dilation=(1, 1))
Closest candidates are:
  ∇conv_data(::A<:AbstractArray, ::A<:AbstractArray, ::A<:AbstractArray; pad, stride, dilation, flipkernel) where A<:AbstractArray at /Users/tpevny/.julia/packages/NNlib/x0XUf/src/conv.jl:39 got unsupported keyword argument "size"
  ∇conv_data(::AbstractArray, ::TrackedArray; kw...) at /Users/tpevny/.julia/dev/Tracker/src/lib/array.jl:412
  ∇conv_data(::TrackedArray, ::AbstractArray; kw...) at /Users/tpevny/.julia/dev/Tracker/src/lib/array.jl:413
Stacktrace:
 [1] (::getfield(Tracker, Symbol("##503#504")){Base.Iterators.Pairs{Symbol,Tuple{Int64,Int64},Tuple{Symbol,Symbol,Symbol},NamedTuple{(:stride, :pad, :dilation),Tuple{Tuple{Int64,Int64},Tuple{Int64,Int64},Tuple{Int64,Int64}}}},TrackedArray{…,Array{Float32,4}},TrackedArray{…,Array{Float32,4}}})(::Array{Float32,4}) at /Users/tpevny/.julia/dev/Tracker/src/lib/array.jl:407
 [2] back_(::Tracker.Call{getfield(Tracker, Symbol("##503#504")){Base.Iterators.Pairs{Symbol,Tuple{Int64,Int64},Tuple{Symbol,Symbol,Symbol},NamedTuple{(:stride, :pad, :dilation),Tuple{Tuple{Int64,Int64},Tuple{Int64,Int64},Tuple{Int64,Int64}}}},TrackedArray{…,Array{Float32,4}},TrackedArray{…,Array{Float32,4}}},Tuple{Tracker.Tracked{Array{Float32,4}},Tracker.Tracked{Array{Float32,4}}}}, ::Array{Float32,4}, ::Bool) at /Users/tpevny/.julia/dev/Tracker/src/back.jl:35
 [3] back(::Tracker.Tracked{Array{Float32,4}}, ::Array{Float32,4}, ::Bool) at /Users/tpevny/.julia/dev/Tracker/src/back.jl:58
 [4] #13 at /Users/tpevny/.julia/dev/Tracker/src/back.jl:38 [inlined]
 [5] foreach at ./abstractarray.jl:1867 [inlined]
 [6] back_(::Tracker.Call{getfield(Tracker, Symbol("#back#526")){2,getfield(Base.Broadcast, Symbol("##2#4")){getfield(Base.Broadcast, Symbol("##8#10")){getfield(Base.Broadcast, Symbol("##1#3")),getfield(Base.Broadcast, Symbol("##5#6")){getfield(Base.Broadcast, Symbol("##5#6")){getfield(Base.Broadcast, Symbol("##7#9"))}},getfield(Base.Broadcast, Symbol("##11#12")){getfield(Base.Broadcast, Symbol("##11#12")){getfield(Base.Broadcast, Symbol("##13#14"))}},getfield(Base.Broadcast, Symbol("##15#16")){getfield(Base.Broadcast, Symbol("##15#16")){getfield(Base.Broadcast, Symbol("##17#18"))}},typeof(+)},typeof(identity)},Tuple{TrackedArray{…,Array{Float32,4}},TrackedArray{…,Array{Float32,4}}}},Tuple{Tracker.Tracked{Array{Float32,4}},Tracker.Tracked{Array{Float32,4}}}}, ::Array{Float32,4}, ::Bool) at /Users/tpevny/.julia/dev/Tracker/src/back.jl:38
 [7] back(::Tracker.Tracked{Array{Float32,4}}, ::Array{Float32,4}, ::Bool) at /Users/tpevny/.julia/dev/Tracker/src/back.jl:58
 [8] foreach at /Users/tpevny/.julia/dev/Tracker/src/back.jl:38 [inlined]
 [9] back_(::Tracker.Call{getfield(Tracker, Symbol("##460#461")){TrackedArray{…,Array{Float32,4}}},Tuple{Tracker.Tracked{Array{Float32,4}}}}, ::Float32, ::Bool) at /Users/tpevny/.julia/dev/Tracker/src/back.jl:38
 [10] back(::Tracker.Tracked{Float32}, ::Int64, ::Bool) at /Users/tpevny/.julia/dev/Tracker/src/back.jl:58
 [11] back!(::Tracker.TrackedReal{Float32}) at /Users/tpevny/.julia/dev/Tracker/src/back.jl:77
 [12] gradient_(::Function, ::Array{Float64,4}) at /Users/tpevny/.julia/dev/Tracker/src/back.jl:4
 [13] #gradient#24 at /Users/tpevny/.julia/dev/Tracker/src/back.jl:164 [inlined]
 [14] gradient(::Function, ::Array{Float64,4}) at /Users/tpevny/.julia/dev/Tracker/src/back.jl:164
 [15] top-level scope at none:0

Thanks for the help.

elu causes stack overflow

Dense(2,2,elu)(randn(2))

Argument not required in im2col_dims

In im2col_dim function in conv.jl, x is an argument. But it is not used in the definition.

Issues with degenerate input for pooling layers

Our pooling layers don't deal well with degenerate inputs (e.g. ones()). MWE:

using Flux
x = ones(10, 10, 1, 1)
xp = param(x)
y_hat = MaxPool((3,3), stride=(2,2))(xp)
Flux.back!(sum(y_hat))

Which results in:

julia> xp.grad
10×10×1×1 Array{Float64,4}:
[:, :, 1, 1] =
 1.0  1.0  2.0  1.0  2.0  1.0  2.0  1.0  1.0  0.0
 1.0  1.0  2.0  1.0  2.0  1.0  2.0  1.0  1.0  0.0
 2.0  2.0  4.0  2.0  4.0  2.0  4.0  2.0  2.0  0.0
 1.0  1.0  2.0  1.0  2.0  1.0  2.0  1.0  1.0  0.0
 2.0  2.0  4.0  2.0  4.0  2.0  4.0  2.0  2.0  0.0
 1.0  1.0  2.0  1.0  2.0  1.0  2.0  1.0  1.0  0.0
 2.0  2.0  4.0  2.0  4.0  2.0  4.0  2.0  2.0  0.0
 1.0  1.0  2.0  1.0  2.0  1.0  2.0  1.0  1.0  0.0
 1.0  1.0  2.0  1.0  2.0  1.0  2.0  1.0  1.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0

Compare with adding a bit of noise to eliminate the degeneracy:

xp = param(x .+ 0.01.*randn(size(x)...))
y_hat = MaxPool((3,3), stride=(2,2))(xp)
Flux.back!(sum(y_hat))

Which results in the proper output of prod(size(y_hat)) == sum(xp.grad) :

julia> xp.grad
10×10×1×1 Array{Float64,4}:
[:, :, 1, 1] =
 0.0  0.0  2.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  2.0  0.0
 0.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0
 0.0  2.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0  1.0  0.0
 0.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  2.0  0.0  0.0  1.0  0.0  0.0  1.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0

The slicing in the inner loops in the pooling layers are expensive

Ref https://gist.github.com/KristofferC/34178cf812f2b3ea468ca217a103c4a6

Removing the slicing (and adding an @inbounds) gives a 30x speedup for the maxpooling.

conv with SubArray

Is there are reason to force x and w have the same type in conv? I would like to use conv with x being a view of another array, i.e. a SubArray. One solution could be to remove the type constraint for x on conv, conv!, ∇conv_filter!, conv2d!, conv2d_grad_w!, ∇conv_filter etc.

ambiguity in calling maxpool

julia> typeof(x)
CuArray{Float32,4}

julia> maxpool(x, PoolDims(x,3), pad = (1,1), stride = (2,2))
ERROR: MethodError: getfield(NNlib, Symbol("#kw##maxpool!"))()(::NamedTuple{(:pad, :stride),Tuple{Tuple{Int64,Int64},Tuple{Int64,Int64}}}, ::typeof(maxpool!), ::CuArray{Float32,4}, ::CuArray{Float32,4}, ::PoolDims{2,(3, 3),(3, 3),(0, 0, 0, 0),(1, 1)}) is ambiguous. Candidates:
  (::getfield(NNlib, Symbol("#kw##maxpool!")))(::Any, ::typeof(maxpool!), y::CuArray{T,N} where N, x::CuArray{T,N} where N, k) where T<:Union{Float16, Float32, Float64} in CuArrays.CUDNN
  (::getfield(NNlib, Symbol("#kw##maxpool!")))(::Any, ::typeof(maxpool!), y::AbstractArray{T,4}, x::AbstractArray{T,4}, pdims::PoolDims) where T in NNlib
Possible fix, define
  (::getfield(NNlib, Symbol("#kw##maxpool!")))(::Any, ::typeof(maxpool!), ::CuArray{T<:Union{Float16, Float32, Float64},4}, ::CuArray{T<:Union{Float16, Float32, Float64},4}, ::PoolDims)
Stacktrace:
 [1] macro expansion at C:\Users\Kristoffer\.julia\packages\NNlib\0FaRq\src\pooling.jl:115 [inlined]
 [2] #maxpool#172(::Base.Iterators.Pairs{Symbol,Tuple{Int64,Int64},Tuple{Symbol,Symbol},NamedTuple{(:pad, :stride),Tuple{Tuple{Int64,Int64},Tuple{Int64,Int64}}}}, ::Function, ::CuArray{Float32,4}, ::PoolDims{2,(3, 3),(3, 3),(0, 0, 0, 0),(1, 1)}) at C:\Users\Kristoffer\.julia\packages\TimerOutputs\7zSea\src\TimerOutput.jl:190
 [3] (::getfield(NNlib, Symbol("#kw##maxpool")))(::NamedTuple{(:pad, :stride),Tuple{Tuple{Int64,Int64},Tuple{Int64,Int64}}}, ::typeof(maxpool), ::CuArray{Float32,4}, ::PoolDims{2,(3, 3),(3, 3),(0, 0, 0, 0),(1, 1)}) at .\none:0
 [4] top-level scope at none:0

Why are softplus and logsigmoid implemented independently?

It is easy to show that softplus(x) == -logσ(-x), so I'm curious about why these functions are defined independently in activation.jl. Is there a reason for this? Wouldn't it be easier to implement/maintain just one of them and use the above relation for the other?

(I did a quick benchmark on my laptop to see which one is the faster. Although I couldn't determine a clear winner, the softplus version seems to be numerically more stable. E.g.,

julia> logσ(40)
-0.0

julia> -softplus(-40)
-4.248354255291589e-18

)

Add maxpool and meanpool methods

During the overhaul we lost maxpool(x, (2, 2)) and co. This was not ideal as they are commonly used by users and, I think, expose a good simple API for doing pooling. We should add these convenience methods back.

∇conv_filter seems broken

I was validating my XLA implementation of the NNlib functions against the provided implemenations and failed to do so for ∇conv_filter. Upon closer inspecting, ∇conv_filter seems quite broken:

julia> f(dy, input, kernel) = NNlib.∇conv_filter(dy, input, kernel; pad=(0,0), stride=(1,1), dilation=(1,1))
f (generic function with 2 methods)

julia> f(Float32.(reshape(a, (2,2,1,1))), reshape(Float32[1.], (1,1,1,1)), Float32.(A))
2×2×1×1 Array{Float32,4}:
[:, :, 1, 1] =
 3.68136e-32  5.44964e-32
 3.68189e-32  4.54e-33

julia> f(Float32.(reshape(a, (2,2,1,1))), reshape(Float32[1.], (1,1,1,1)), Float32.(A))
2×2×1×1 Array{Float32,4}:
[:, :, 1, 1] =
 1.55684e-41  1.55684e-41
 1.55684e-41  1.55544e-42

(notice the values changing so something is reading uninitialized memory).

Integration with CuArrays: exp vs CUDAnative.exp

using CuArrays
using NNlib

A = cu(rand(Float32, 10, 100))
B = cu(rand(Float32, 10))
CUDAnative.exp.(NNlib.σ.(A .+ B))

gives:

warning: ignoring debug info with an invalid version (0) in #1
warning: ignoring debug info with an invalid version (0) in 
ERROR: LLVM error: All DICompileUnits must be listed in llvm.dbg.cu

Stacktrace:
 [1] verify(::LLVM.Module) at /home/slipslop/.julia/v0.6/LLVM/src/analysis.jl:11
 [2] #add_entry!#26(::Bool, ::Function, ::LLVM.Module, ::Any, ::Any) at /home/slipslop/.julia/v0.6/CUDAnative/src/jit.jl:251
 [3] (::CUDAnative.#kw##add_entry!)(::Array{Any,1}, ::CUDAnative.#add_entry!, ::LLVM.Module, ::Any, ::Any) at ./<missing>:0
 [4] #compile_function#51(::Bool, ::Function, ::Any, ::Any, ::VersionNumber) at /home/slipslop/.julia/v0.6/CUDAnative/src/jit.jl:402
 [5] cufunction(::CUDAdrv.CuDevice, ::Any, ::Any) at /home/slipslop/.julia/v0.6/CUDAnative/src/jit.jl:465
 [6] macro expansion at /home/slipslop/.julia/v0.6/CUDAnative/src/execution.jl:108 [inlined]
 [7] _cuda(::Tuple{Int64,Int64}, ::Int64, ::CUDAdrv.CuStream, ::CuArrays.#broadcast_kernel, ::##1#2, ::CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global}, ::Tuple{Tuple{Bool,Bool},Tuple{Bool}}, ::Tuple{Tuple{Int64,Int64},Tuple{Int64}}, ::CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global}, ::Tuple{CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}}) at /home/slipslop/.julia/v0.6/CUDAnative/src/execution.jl:80
 [8] _broadcast! at /home/slipslop/.julia/v0.6/CuArrays/src/broadcast.jl:22 [inlined]
 [9] broadcast_t at /home/slipslop/.julia/v0.6/CuArrays/src/broadcast.jl:37 [inlined]
 [10] broadcast_c at /home/slipslop/.julia/v0.6/CuArrays/src/broadcast.jl:58 [inlined]
 [11] broadcast(::Function, ::CuArray{Float32,2}, ::CuArray{Float32,1}) at ./broadcast.jl:455

Most likely it's caused by difference between Base.exp and CUDAnative.exp: σ uses the first one, but it doesn't work for CuArrays. Normally I resolve it by overloading a function (like this for CUDAnative.log), but in NNlib sigmoid is defined for Float32 and broadcast to any array type it's applied to.

Right now I can't think of any way to define σ (and similar functions) so that they work for both - Array and CuArray, ideas are welcome.

I use Julia 0.6.2, CUDAnative 0.5.3 (the last one for 0.6.x Julia) and latest master of CuArrays.

Bad gradient for pool with padding

Unless I am doing something wrong:

using NNlib

x = reshape([(1.:9.)...], 3, 3, 1, 1)
pdims = PoolDims(size(x), (2,2), padding = (1,1), stride = (2,2))
y = maxpool(x, pdims)
dy = y .* 0 .+ 1
dx = ∇maxpool(dy, y, x, pdims)

display(x)
display(y)
display(dy)
display(dx)

gives

3×3×1×1 Array{Float64,4}:                                                                                                     
[:, :, 1, 1] =                                                                                                                
 1.0  4.0  7.0                                                                                                                
 2.0  5.0  8.0                                                                                                                
 3.0  6.0  9.0                                                                                                                
2×2×1×1 Array{Float64,4}:                                                                                                     
[:, :, 1, 1] =                                                                                                                
 1.0  7.0                                                                                                                     
 3.0  9.0                                                                                                                     
2×2×1×1 Array{Float64,4}:                                                                                                     
[:, :, 1, 1] =                                                                                                                
 1.0  1.0                                                                                                                     
 1.0  1.0                                                                                                                     
3×3×1×1 Array{Float64,4}:                                                                                                     
[:, :, 1, 1] =                                                                                                                
 1.0  0.0  0.0                                                                                                                
 0.0  0.0  0.0                                                                                                                
 0.0  0.0  1.0

The gradients for 3 and 7 should not be 0.

TimerOutputs ruins our line numbers

Which is a bit unfortunate when trying to track down which method gets called. It also ruins our stack traces.

Nd softmax

In NNlib, we have 1D/2D softmax, in CUDNN we have 4D softmax, so essentially we don't have any that works for both - CPU and GPU.

More generally, we may want to provide softmax implementation with configurable dimension to scale (e.g. like in PyTorch).

Fill in package description

This will say "No description, website, or topics provided." on pkg.julialang.org until you do. Same on DataFlow.

Deprecation warnings.

Deprecation warning while using relu.

julia> using Flux

julia> relu(rand(4,4))
WARNING: max(x::AbstractArray{T1}, y::AbstractArray{T2}) where {T1 <: Real, T2 <: Real} is deprecated, use max.(x, y) instead.
Stacktrace:
 [1] depwarn(::String, ::Symbol) at ./deprecated.jl:70
 [2] max(::Array{Float64,2}, ::Array{Float64,2}) at ./deprecated.jl:57
 [3] relu(::Array{Float64,2}) at /home/ayush99/.julia/v0.6/NNlib/src/activation.jl:48
 [4] eval(::Module, ::Any) at ./boot.jl:235
 [5] eval_user_input(::Any, ::Base.REPL.REPLBackend) at ./REPL.jl:66
 [6] macro expansion at ./REPL.jl:97 [inlined]
 [7] (::Base.REPL.##1#2{Base.REPL.REPLBackend})() at ./event.jl:73
while loading no file, in expression starting on line 0
4×4 Array{Float64,2}:
 0.283668  0.0684376  0.83191   0.41665 
 0.617301  0.332963   0.448118  0.328916
 0.769337  0.872009   0.279832  0.931938
 0.701046  0.345068   0.288273  0.185077

Not exactly an issue, but it'd be better to remove such warnings.

Docstrings & tests for the new convolution interface

I'm trying to understand the meaning and usage of this signature in the new convolution interface:

conv!(y::AbstractArray{T,3}, x::AbstractArray{T,3}, w::AbstractArray{T,3}

w here should have at least 4 parameters:

filter height
filter width
number of input channels
number of output channels

So having w of 3D doesn't make sense to me.

Also, for x and y we seem to add an additional dimension into position 2. I guess the intent was to add a dummy batch or channel dimension, which would be true for C API (i.e. in NCHW scheme). However, in Julia API we normally use spatial dimensions first, i.e. HWCN scheme (or WHCN, I always forget which one). So adding a new dimension at position 2 means setting image width to 1.

Am I misunderstanding something?

All in all, I have doubts regarding automatic addition of missing dimensions and having a single conv! dispatching on array dimensions: with conv2d + 4D tensors and conv3d + 5D tensors it's absolutely clear what each dimension means. However, 3D tensor in conv! may be interpreted as missing channel dimension (multiple grayscale images) or missing batch dimension (single colored image). Similarly, conv! on 4D tensor may be interpreted as conv2d! or conv3d with one missing dimension, and so on.

Softmax and crossentropy are prone to produce NaNs

Hi,

I have found examples, where softmax or logitcrossentropy overflows. I will try to find the problem, but I will file the issue first.

This overflows logitcrossentropy.

o = param(reshape([ -431.0;  279.0;   427.0], 3, 1))
y = [true, false, false]
f = Flux.logitcrossentropy(o, y);
Flux.Tracker.back!(f); 
Flux.Tracker.grad(o)

I believe the problem is in softmax in this line ∇logsoftmax(Δ, xs) = ∇softmax(Δ ./ softmax(xs), xs) in logsoftmax.jl:26.

A second example just overflows the softmax as

y = [true, false, false]
o = param(reshape([ -1047, -981, 1891],3,1))
Flux.Tracker.back!(sum(softmax(o))); 
Flux.Tracker.grad(o)

The problem here is that softmax(o) outputs [0.0, 0.0, 1.0]. The problem here is in ∇softmax, which uses exp without checking that it might overflow.

Nested AD not defined for conv; any plan?

Currently, it seems that convolutions do not support higher-order derivatives (which is required for, e.g., WGAN-GP):

julia> using Flux, Tracker

julia> let c = Conv((3, 3), 1 => 1)
           Tracker.gradient(() -> sum(Tracker.jacobian(x -> vec(c(reshape(x, 3, 3, 1, 1))), ones(9)) .^ 2), params(c))
       end
ERROR: Nested AD not defined for conv

(The error is from Tracker.jl but it looks like the actual implementation for convolutions and its derivatives are done in NNlib.jl. So that's why I'm posting here.)

Is there a plan to support this in Flux ecosystem? Would it happen in NNlib if that's the case? Also, I guess it's mostly non-interacting with Tracker-to-Zygote transition (as both of them would call primitives defined in NNlib)?

softmax return result not accurate

using NNlib
softmax1(x) = exp.(x) ./ sum(exp.(x))
y1 = [0.0002, 0.2, 0.9,0.0001,0.4,0.6]
println("Stablized vector ",softmax(y1))
println("sum of vector ",sum(softmax(y1)))
println("sum of vector ",sum(softmax1(y1)))

Stablized vector [0.111192, 0.135783, 0.273434, 0.111181, 0.165846, 0.202565]
sum of vector 1.0000000000000002
sum of vector 1.0

Asymmetric padding fails on gpu models

See FluxML/Flux.jl#775.

BatchNorm and Dropout

Does it make sense to put functional forms of BatchNorm and Dropout into NNlib so that other packages could simply import them from here?

Requesting `log_softmax`, `log_sigmod`, `log_sum_exp`

The output of softmax will be 0 when the number is very close to 0 especially in the case that it is represented by Float32 in CuArray, and it will then leads to an infinity loss if we use the output to compute the cross entropy loss. It will be more numerically stable if NNlib could provide log_softmax function.

Overflow of Float64 logistic sigmoid derivative

I've encountered NaN derivatives when using the logistic sigmoid (σ) for moderate sized negative numbers. I think the boundary is about x=-709.783:

using Flux: Tracker
using ForwardDiff, NNlib

# Within Flux
x = Tracker.param(-709.783)
Tracker.back!(σ(x))
Tracker.grad(x) # = NaN

# Overflow in Dual Number:
σ(ForwardDiff.Dual(-709.783,1.0))  # Dual{Nothing}(0.0,NaN)

since (1 + exp(709.782)) ≈ 1.8 x 10^308. Given that there is a σ_stable function defined presumably for precisely this issue when using Float32, should this be extended to all real numbers?

EDIT: A similar problem occurs in softplus at the same (negated) value, which may benefit from setting to the identity above e.g. x=80. (The other activation functions look ok.)

Erratic Dilation behavior

Uh-oh, I should have tested this more thoroughly looks like. :/

# 2-length convolution with 1 channel in and 1 channel out
w = reshape(Float64[1, 1], (2, 1, 1))

# 128-length input vector with 1 channel and 1 batch.
x = reshape(Float64[1:128...], (128, 1, 1))

# Let's calculate the same thing a couple different times
ys = []
for idx in 1:10
    push!(ys, NNlib.conv(x, w; dilation = 2))
end

# Do some of these randomly have a bunch of NaN's in them?  Why yes they do:
[any(isnan.(ys[idx])) for idx in 1:10]

Running the above results in:

10-element Array{Bool,1}:
 false
  true
  true
 false
 false
 false
 false
 false
  true
  true

The fact that this is erratic makes me think the BLAS routine is reading from uninitialized/clobbered memory.

Convolutions with negative padding segfault... sometimes

Sometimes a negative padding fails (only for Float32?), sometimes it doesn't...

This works:

using Flux
model = Conv((2,2), 1=>1 , pad=(-1,-1))
x32 = rand(Float32, 10, 10, 1, 1)
x64 = rand(Float64, 10, 10, 1, 1)
model(x32)
model(x64)

and this works:

x64 = rand(Float64, 20, 20, 1, 1)
model(x64)

but this fails (segfaults):

x64 = rand(Float32, 20, 20, 1, 1)
model(x64)

with the following error message:

signal (11): Segmentation fault
in expression starting at no file:0
sgemm_itcopy_HASWELL at /home/troels/packages/julias/julia-1.1.1/bin/../lib/julia/libopenblas64_.so (unknown line)
sgemm_nn at /home/troels/packages/julias/julia-1.1.1/bin/../lib/julia/libopenblas64_.so (unknown line)
sgemm_64_ at /home/troels/packages/julias/julia-1.1.1/bin/../lib/julia/libopenblas64_.so (unknown line)
gemm! at /home/troels/.julia/packages/NNlib/mxWRT/src/gemm.jl:49 [inlined]
macro expansion at /home/troels/.julia/packages/TimerOutputs/7zSea/src/TimerOutput.jl:230 [inlined]
macro expansion at /home/troels/.julia/packages/NNlib/mxWRT/src/impl/conv_im2col.jl:57 [inlined]
macro expansion at ./gcutils.jl:87 [inlined]
macro expansion at /home/troels/.julia/packages/NNlib/mxWRT/src/impl/conv_im2col.jl:53 [inlined]
#conv_im2col!#231 at /home/troels/.julia/packages/TimerOutputs/7zSea/src/TimerOutput.jl:190
unknown function (ip: 0x7f5b42ed3f2a)
jl_fptr_trampoline at /buildworker/worker/package_linux64/build/src/gf.c:1842
jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2348
conv_im2col! at /home/troels/.julia/packages/TimerOutputs/7zSea/src/TimerOutput.jl:198 [inlined]
macro expansion at /home/troels/.julia/packages/NNlib/mxWRT/src/conv.jl:51 [inlined]
#conv!#37 at /home/troels/.julia/packages/TimerOutputs/7zSea/src/TimerOutput.jl:190 [inlined]
conv! at /home/troels/.julia/packages/TimerOutputs/7zSea/src/TimerOutput.jl:198
unknown function (ip: 0x7f5b42ed2462)
jl_fptr_trampoline at /buildworker/worker/package_linux64/build/src/gf.c:1842
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2197
#conv!#56 at /home/troels/.julia/packages/NNlib/mxWRT/src/conv.jl:68
unknown function (ip: 0x7f5b42ed19e6)
jl_fptr_trampoline at /buildworker/worker/package_linux64/build/src/gf.c:1842
jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2348
conv! at /home/troels/.julia/packages/NNlib/mxWRT/src/conv.jl:68
unknown function (ip: 0x7f5b42ed1332)
jl_fptr_trampoline at /buildworker/worker/package_linux64/build/src/gf.c:1842
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2197
macro expansion at /home/troels/.julia/packages/NNlib/mxWRT/src/conv.jl:114 [inlined]
#conv#97 at /home/troels/.julia/packages/TimerOutputs/7zSea/src/TimerOutput.jl:190
unknown function (ip: 0x7f5b42ed04c2)
jl_fptr_trampoline at /buildworker/worker/package_linux64/build/src/gf.c:1842
jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2348
#_forward#524 at /home/troels/.julia/packages/TimerOutputs/7zSea/src/TimerOutput.jl:198 [inlined]
_forward at ./none:0
unknown function (ip: 0x7f5b42ed021d)
jl_fptr_trampoline at /buildworker/worker/package_linux64/build/src/gf.c:1842
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2197
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1571 [inlined]
jl_f__apply at /buildworker/worker/package_linux64/build/src/builtins.c:556
#track#1 at /home/troels/.julia/packages/Tracker/RRYy6/src/Tracker.jl:51
jl_fptr_trampoline at /buildworker/worker/package_linux64/build/src/gf.c:1842
jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2348
track at /home/troels/.julia/packages/Tracker/RRYy6/src/Tracker.jl:51 [inlined]
#conv#522 at /home/troels/.julia/packages/Tracker/RRYy6/src/lib/array.jl:419 [inlined]
conv at /home/troels/.julia/packages/Tracker/RRYy6/src/lib/array.jl:419 [inlined]
Conv at /home/troels/.julia/packages/Flux/qXNjB/src/layers/conv.jl:55
jl_fptr_trampoline at /buildworker/worker/package_linux64/build/src/gf.c:1842
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2197
do_call at /buildworker/worker/package_linux64/build/src/interpreter.c:323
eval_value at /buildworker/worker/package_linux64/build/src/interpreter.c:411
eval_stmt_value at /buildworker/worker/package_linux64/build/src/interpreter.c:362 [inlined]
eval_body at /buildworker/worker/package_linux64/build/src/interpreter.c:773
jl_interpret_toplevel_thunk_callback at /buildworker/worker/package_linux64/build/src/interpreter.c:885
unknown function (ip: 0xfffffffffffffffe)
unknown function (ip: 0x7f5b5c91a94f)
unknown function (ip: 0xffffffffffffffff)
jl_interpret_toplevel_thunk at /buildworker/worker/package_linux64/build/src/interpreter.c:894
jl_toplevel_eval_flex at /buildworker/worker/package_linux64/build/src/toplevel.c:764
jl_toplevel_eval_in at /buildworker/worker/package_linux64/build/src/toplevel.c:793
eval at ./boot.jl:328
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2197
eval_user_input at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.1/REPL/src/REPL.jl:85
run_backend at /home/troels/.julia/packages/Revise/agmgx/src/Revise.jl:949
#75 at ./task.jl:259
jl_fptr_trampoline at /buildworker/worker/package_linux64/build/src/gf.c:1842
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2197
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1571 [inlined]
start_task at /buildworker/worker/package_linux64/build/src/task.c:572
unknown function (ip: 0xffffffffffffffff)
Allocations: 40188553 (Pool: 40182088; Big: 6465); GC: 81
Segmentation fault (core dumped)

Depending on details in kernel size, the size of the input array, and whether the padding is symmetric or not, I get different error messages/crashes (double free or corruption (!prev), corrupted size vs. prev_size, malloc(): smallbin double linked list corrupted, malloc_consolidate(): invalid chunk size...). Typically, the convolution crashes when the input arrays become larger than some small size. So far, the issue only seems to affect Float32 inputs.

I'm on Julia 1.1.1, Flux 0.8.3, NNlib 0.6.0 (and Ubuntu 19.04).

Integration with Batched.jl

Discussed with @MikeInnes on slack a little bit. I implemented some batched operations and its dependent types here: https://github.com/Roger-luo/Batched.jl

Since we can overload operators for BatchedArray, this won't need any extra code for the AD (as long as you input your data/define the variable as a BatchedArray). So I was thinking to directly add a dependency in Flux.

Or what is preferred to have in NNlib?

PS. Currently, the batched_gemm in Batched is 2x faster than PyTorch, batched_tr is 363x faster for a naive python loop with torch.trace in PyTorch (they don't have batched trace....)

cc: @MikeInnes

Fast Convolutions and Performance in NNlib

Convolutions provided by the FastConv package

Described in their paper is considerably outperforming the back ends for 1D and 2D convolutions. At least on CPU.

using FastConv
using NNlib
using BenchmarkTools

x = randn(500,500,1,1)
spatial_dims = (5,5)
k = randn(spatial_dims...,1,1)

cdims = DenseConvDims(x,k; padding= spatial_dims .-1)

fast_y = @btime convn(x,k);
# 9.582 ms (8 allocations: 1.94 MiB)

nnlib_y = @btime conv(x,k,cdims);
#244.020 ms (33 allocations: 5.80 MiB) 

nnlib_im2col_y = @btime NNlib.conv_im2col(x,k,cdims);
#10.453 ms (50 allocations: 50.39 MiB)

isapprox(fast_y,nnlib_y,atol=1e-3)
#true

Convolution for mixed-precision inputs

Currently it's easy to get errors due to passing (say) an f32 filter and f64 input to a convolution. We should make this a bit more liberal, either by loosening any unecessary type restrictions (some of the implementations are pure julia and this should work) or by just promoting the arrays to a common type.

	## Convolutions on CPU. Almost all the code is borrowed from Knet.jl
	## For GPU versions see CUDNN.jl