sylvaticus / betaml.jl Goto Github PK

Beta Machine Learning Toolkit

License: MIT License

Julia 100.00%

ai artificial-intelligence autoencoder classification clustering data-science decision-trees deep-learning feature-importance imputation julia machine-learning ml neural-networks pca random-forest regression

betaml.jl's People

Contributors

Stargazers

Watchers

Forkers

samhajjar frhack gunjantitiya jluosg pallharaldsson zeta1999 artexmg standardgalactic skahanium rikhuijzer roland-ka mthelm85 rubsc

betaml.jl's Issues

Corner case for KernelPerceptronClassifier: unique target class

If only one class is seen in the training data, the model fits okay, but prediction fails. I wonder if this is something that could be supported. Encountered this issue when doing cv for a very small binary classification problem (crabs).

using MLJ

Model = @load KernelPerceptronClassifier

model = Model()

X = (x=rand(10), );

y = coerce(collect("aaaaaaaaaab"), Multiclass)[1:10];

julia> unique(y)
1-element Vector{Char}:
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)

julia> levels(y)
2-element Vector{Char}:
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)

# works fine:
mach = machine(model, X, y) |> fit!;

# problem:
julia> predict_mode(mach, X)
ERROR: BoundsError: attempt to access 0-element Vector{Matrix{Float64}} at index [1]
Stacktrace:
 [1] getindex
   @ ./array.jl:861 [inlined]
 [2] predict(x::Matrix{Float64}, xtrain::Vector{Matrix{Float64}}, ytrain::Vector{Vector{Int64}}, α::Vector{Vector{Int64}}, classes::Vector{Char}; K::typeof(BetaML.Utils.radialKernel))
   @ BetaML.Perceptron ~/.julia/packages/BetaML/AeLyL/src/Perceptron/Perceptron.jl:622
 [3] predict(model::BetaML.Perceptron.KernelPerceptronClassifier, fitresult::Tuple{NamedTuple{(:x, :y, :α, :classes, :K), Tuple{Vector{Matrix{Float64}}, Vector{Vector{Int64}}, Vector{Vector{Int64}}, Vector{Char}, typeof(BetaML.Utils.radialKernel)}}, Vector{Char}}, Xnew::NamedTuple{(:x,), Tuple{Vector{Float64}}})                                                  
   @ BetaML.Perceptron ~/.julia/packages/BetaML/AeLyL/src/Perceptron/Perceptron_MLJ.jl:137
 [4] predict_mode(m::BetaML.Perceptron.KernelPerceptronClassifier, fitresult::Tuple{NamedTuple{(:x, :y, :α, :classes, :K), Tuple{Vector{Matrix{Float64}}, Vector{Vector{Int64}}, Vector{Vector{Int64}}, Vector{Char}, typeof(BetaML.Utils.radialKernel)}}, Vector{Char}}, Xnew::NamedTuple{(:x,), Tuple{Vector{Float64}}})                                                  
   @ MLJBase ~/MLJ/MLJBase/src/interface/model_api.jl:11
 [5] predict_mode(mach::Machine{BetaML.Perceptron.KernelPerceptronClassifier, true}, Xraw::NamedTuple{(:x,), Tuple{Vector{Float64}}})                                                  
   @ MLJBase ~/MLJ/MLJBase/src/operations.jl:85
 [6] top-level scope
   @ REPL[39]:1

Improve oneHotEncode stability for encoding integers embedding categories

julia> oneHotEncoder([-1,1,1])
ERROR: BoundsError: attempt to access 1-element Vector{Int64} at index [-1]
Stacktrace:
 [1] setindex!
   @ ./array.jl:903 [inlined]
 [2] oneHotEncoderRow(x::Int64; d::Int64, factors::UnitRange{Int64}, count::Bool)
   @ BetaML.Utils ~/.julia/packages/BetaML/cpTAz/src/Utils/Processing.jl:64
 [3] oneHotEncoder(Y::Vector{Int64}; d::Int64, factors::UnitRange{Int64}, count::Bool)
   @ BetaML.Utils ~/.julia/packages/BetaML/cpTAz/src/Utils/Processing.jl:127
 [4] oneHotEncoder(Y::Vector{Int64})
   @ BetaML.Utils ~/.julia/packages/BetaML/cpTAz/src/Utils/Processing.jl:121
 [5] top-level scope
   @ REPL[5]:1

julia> oneHotEncoder([-1,1,1],factors=[-1,1])
ERROR: BoundsError: attempt to access 1-element Vector{Int64} at index [-1]
Stacktrace:
 [1] setindex!
   @ ./array.jl:903 [inlined]
 [2] oneHotEncoderRow(x::Int64; d::Int64, factors::UnitRange{Int64}, count::Bool)
   @ BetaML.Utils ~/.julia/packages/BetaML/cpTAz/src/Utils/Processing.jl:64
 [3] oneHotEncoder(Y::Vector{Int64}; d::Int64, factors::Vector{Int64}, count::Bool)
   @ BetaML.Utils ~/.julia/packages/BetaML/cpTAz/src/Utils/Processing.jl:127
 [4] top-level scope
   @ REPL[6]:1

Rename Nn to NN

Since you where renaming the package only two days ago, should NN also be used (more appropriate as acronym, like RBF).

I'm not sure it's advised to have both (a new module adding the old). If you change, or if not, could something like:

module Nn
 [init] throw("do use: using NN")
end

work or wise versa?

Allow `verbosity` to be any integer?

I was surprised when re-running some ecosystem-wide integration tests to get this message when training these using the MLJ interface: MultitargetNeuralNetworkRegressor NeuralNetworkRegressor:

Wrong verbosity level. Verbosity must be either 0, 10, 20, 30 or 40

I was probably using verbosity =-1 to suppress warnings.

I understand MLJ spec is mostly silent on this, but in practice the rule has been : "With the exception of warnings, training should be silent if verbosity == 0. Lower values should suppress warnings" and I would add "any integer should be allowed".

Perhaps in the MLJ interface for the BetaML models one could map

<= 0 -> 0
1 -> 10
2 -> 20
3 -> 30
>= 5 -> 40

or similar ??

Iplement comments for AutoEncoderMLJ

Implement the comments of @ablaom for AutoEncoderMLJ

Tag a new release to enable use with Distributions 0.25

The compat upgrade has been merged to master but a new release was never tagged.

@sylvaticus Could we please have a new tagged release? This is causing issues downstream for MLJModels.

MLJ interface for `KernelPerceptronClassifier` is not tracking all target levels

julia> using MLJ

julia> Model = @load KernelPerceptronClassifier
[ Info: For silent loading, specify `verbosity=0`. 
import BetaML ✔
BetaML.Perceptron.KernelPerceptronClassifier

julia> model = Model()
KernelPerceptronClassifier(
  K = BetaML.Utils.radialKernel, 
  maxEpochs = 100, 
  initialα = Int64[], 
  shuffle = false, 
  rng = Random._GLOBAL_RNG())

julia> X = (x=rand(10), );

julia> y = coerce(collect("abababababcc"), Multiclass)[1:10];

julia> unique(y)
2-element Vector{Char}:
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)

julia> levels(y)
3-element Vector{Char}:
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
 'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)

julia> mach = machine(model, X, y) |> fit!;
[ Info: Training machine(KernelPerceptronClassifier(K = radialKernel, …), …).

julia> predict_mode(mach, X) |> levels
2-element Vector{Char}:
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)

That last indicates a bug, as all levels in the pool of the training vector should be present in the pool of the predictions.

Curiously in other classifiers I looked at, the levels are indeed being tracked correctly. So perhaps have a look at, eg, the BetaML DecisionTreeClassifier to see how this can be corrected.

This bug is causing a failure when the model is bagged in an ensemble using EnsembleModel because some classes are not present in some of the bagged observations, but are present in others.

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

Add a Why BetaML section at the top of the readme

The first thing I want to know looking at this package is why would I use it? What is the big advantage over vanilla flux? It would be great to add this right at the beginning of the readme.

Error generating MLJ model registry

Running MLJModels.@update to update MLJ's model registry is running into this new error:

ERROR: LoadError: Bad `load_path` trait for BetaML.Imputation.BetaMLGMMImputer: BetaMLGMMImputer not a registered package. 
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:33
 [2] top-level scope
   @ ~/MLJ/MLJModels/src/registry/src/update.jl:122
 [3] eval
   @ ./boot.jl:373 [inlined]
 [4] eval(x::Expr)
   @ Base.MainInclude ./client.jl:453
 [5] _update(mod::Module, test_env_only::Bool)
   @ MLJModels.Registry ~/MLJ/MLJModels/src/registry/src/update.jl:153
 [6] var"@update"(__source__::LineNumberNode, __module__::Module)
   @ MLJModels.Registry ~/MLJ/MLJModels/src/registry/src/update.jl:24
in expression starting at REPL[4]:1

Error during precompilation (ERROR: LoadError: InitError: Evaluation into the closed module `Perceptron` ...)

I wanted to use BetaML.jl in a project, however when I try doing so I get the following error:

julia> using Foo
[ Info: Precompiling Foo [4817f03b-69bd-4595-9d0a-a711fd8a192f]
ERROR: LoadError: InitError: Evaluation into the closed module `Perceptron` breaks incremental compilation because the side effects will not be permanent. This is likely due to some other module mutating `Perceptron` with `eval` during precompilation - don't do this.
Stacktrace:
  [1] eval
    @ ./boot.jl:368 [inlined]
  [2] eval(x::Expr)
    @ BetaML.Perceptron ~/.julia/packages/BetaML/mqBvh/src/Perceptron/Perceptron.jl:19
  [3] metadata_pkg(T::Type; name::String, uuid::String, url::String, julia::Bool, license::String, is_wrapper::Bool, package_name::String, package_uuid::String, package_url::String, is_pure_julia::Bool, package_license::String)
    @ MLJModelInterface ~/.julia/packages/MLJModelInterface/wwFA9/src/metadata_utils.jl:54
  [4] #41
    @ ./broadcast.jl:1284 [inlined]
  [5] _broadcast_getindex_evalf
    @ ./broadcast.jl:670 [inlined]
  [6] _broadcast_getindex
    @ ./broadcast.jl:643 [inlined]
  [7] #29
    @ ./broadcast.jl:1075 [inlined]
  [8] macro expansion
    @ ./ntuple.jl:74 [inlined]
  [9] ntuple
    @ ./ntuple.jl:69 [inlined]
 [10] copy
    @ ./broadcast.jl:1075 [inlined]
 [11] materialize
    @ ./broadcast.jl:860 [inlined]
 [12] __init__()
    @ BetaML ~/.julia/packages/BetaML/mqBvh/src/BetaML.jl:63
 [13] _include_from_serialized(pkg::Base.PkgId, path::String, depmods::Vector{Any})
    @ Base ./loading.jl:831
 [14] _require_search_from_serialized(pkg::Base.PkgId, sourcepath::String, build_id::UInt64)
    @ Base ./loading.jl:1039
 [15] _require(pkg::Base.PkgId)
    @ Base ./loading.jl:1315
 [16] _require_prelocked(uuidkey::Base.PkgId)
    @ Base ./loading.jl:1200
 [17] macro expansion
    @ ./loading.jl:1180 [inlined]
 [18] macro expansion
    @ ./lock.jl:223 [inlined]
 [19] require(into::Module, mod::Symbol)
    @ Base ./loading.jl:1144
 [20] include
    @ ./Base.jl:419 [inlined]
 [21] include_package_for_output(pkg::Base.PkgId, input::String, depot_path::Vector{String}, dl_load_path::Vector{String}, load_path::Vector{String}, concrete_deps::Vector{Pair{Base.PkgId, UInt64}}, source::Nothing)
    @ Base ./loading.jl:1554
 [22] top-level scope
    @ stdin:1
during initialization of module BetaML
in expression starting at /data_temp/picaud/Temp/Beta/Foo.jl/src/Foo.jl:1
in expression starting at stdin:1
ERROR: Failed to precompile Foo [4817f03b-69bd-4595-9d0a-a711fd8a192f] to /home/picaud/.julia/compiled/v1.8/Foo/jl_a1tr7Z.
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:35
 [2] compilecache(pkg::Base.PkgId, path::String, internal_stderr::IO, internal_stdout::IO, keep_loaded_modules::Bool)
   @ Base ./loading.jl:1707
 [3] compilecache
   @ ./loading.jl:1651 [inlined]
 [4] _require(pkg::Base.PkgId)
   @ Base ./loading.jl:1337
 [5] _require_prelocked(uuidkey::Base.PkgId)
   @ Base ./loading.jl:1200
 [6] macro expansion
   @ ./loading.jl:1180 [inlined]
 [7] macro expansion
   @ ./lock.jl:223 [inlined]
 [8] require(into::Module, mod::Symbol)
   @ Base ./loading.jl:1144

The error is not present when I remove precompilation, the BetaML.jl "patch" is:

# function __init__()
#     MMI.metadata_pkg.(MLJ_INTERFACED_MODELS,
#         name       = "BetaML",
#         uuid       = "024491cd-cc6b-443e-8034-08ea7eb7db2b",     # see your Project.toml
#         url        = "https://github.com/sylvaticus/BetaML.jl",  # URL to your package repo
#         julia      = true,     # is it written entirely in Julia?
#         license    = "MIT",    # your package license
#         is_wrapper = false,    # does it wrap around some other package?
#     )
# end

Steps to reproduce :

Create a local package Foo (in /tmp/ by example)

cd /tmp

(@v1.8) pkg> generate Foo.jl
  Generating  project Foo:
    Foo.jl/Project.toml
    Foo.jl/src/Foo.jl

(@v1.8) pkg> activate ./Foo.jl/
  Activating project at `/tmp/Foo.jl`

(Foo) pkg> add BetaML

(Foo) pkg> activate 
  Activating project at `~/.julia/environments/v1.8`

(@v1.8) pkg> dev ./Foo.jl/
   Resolving package versions...

Then modify Foo.jl as follows :

module Foo

using BetaML # <---- here

greet() = print("Hello World!")

end # module Foo

Then from Julia type

julia> using Foo

and I (and maybe you) will get the error I mentioned at the beginning.

Thanks!

MLJ traits for GMMClusterer

The GMMClusterer is an unsupervised probabilistic model. However we can't check that programmatically because of JuliaAI/MLJModelInterface.jl#120

Is there any fix to make sure that both KMeans and GMMClusterer return a set of categorical values? Right now predict(Kmeans(), ...) will return a vector of categorical values whereas predict(GMMClusterer(), ...) will return a vector of distributions.

Bug in Clustering_MLJ caused by spelling mistake

Need fitesult -> fitresult in

BetaML.jl/src/Clustering/Clustering_MLJ.jl

Line 178 in df11d62

 MMI.fitted_params(model::Union{KMeans,KMedoids}, fitresult) = (centers=fitesult[2], cluster_labels=CategoricalArrays.categorical(fitresults[1])) 

Bug in GMM caused by spelling mistake

Picked this up in an integration test:

BetaML.jl/src/GMM/GMM_MLJ.jl

Line 325 in df11d62

 MMI.fitted_params(model::GaussianMixtureClusterer, fitresult) = (weights=fitesult.pₖ, mixtures=fitresult.mixtures) 

fitesult should be fitresult

MLJ interface: fit should not mutate model fields

In MLJ learned parameters are distinct from hyper-parameters. A "model" in MLJ is a container for hyper-parameters and that is all.

For this reason, there should be no reason forMMI.fit should to mutate model fields and the original API forbade this (Unfortunately, this rule seems to have disappeared from the docs JuliaAI/MLJ.jl#755). Only clean! can mutate the fields, and only if they don't make sense. One execption is that fit may mutate a RNG.

So this is currently non-compliant:

using Pkg
Pkg.activate(temp=true)
Pkg.add("MLJBase")
Pkg.add(name="BetaML", rev="master")

using MLJBase
import BetaML

model = BetaML.Clustering.MissingImputator()
mixtures = deepcopy(model.mixtures)

X = [1 10.5;1.5 missing; 1.8 8; 1.7 15; 3.2 40; missing missing; 3.3 38;
     missing -2.3; 5.2 -2.4] |> MLJBase.table

mach = machine(model, X) |> fit!

julia> @assert model.mixtures == mixtures
ERROR: AssertionError: model.mixtures == mixtures
Stacktrace:
    [1] top-level scope at REPL[40]:1

Maybe MMI.fit can begin by creating a deepcopy of mixtures and p₀, in this and the related models.

Add PAM algorithm to fit KMedoidsClusterer

Add PAM (or FastPAM) to fit the KMedoidsClusterer

see mthelm85/PAM.jl#3

Separate into subpackages?

Specifically, I think separating the modules in this into subpackages (i.e. reexported as part of a larger overall BetaML package) would help a lot with discoverability; for instance, the problem I mentioned earlier of people in the stats community having lots of trouble finding the imputation methods here.

The input scitypes for trees are incorrect

From here:

    input_scitype    = MMI.Table(MMI.Missing, MMI.Known),           # also ok: MMI.Table(Union{MMI.Missing, MMI.Known}),

What is written in the comment is correct. What is actually used is not:

julia> X = (; x=[missing, 1, 2])
(x = Union{Missing, Int64}[missing, 1, 2],)

julia> scitype(X) <: Table(Missing, Known)
false

julia> scitype(X) <: Table(Union{Missing, Known})
true

For more on the Table scitype constructor, see here.

All the tree scitypes need changing.

Sorry that I did not pick this up in my review.

GaussianMixtureModelClusterer docstring has formatting issues

Avoid observation-by-observation construction of UnivariateFinite objects in MLJ interface

The overhead for constructing UnivariateFinite objects one at a time is very high. For this reason a UnivariateFiniteArray implementation of AbstractArray{<:UnivariateFinite} was developed. This includes optimised implementations of broadcasting pdf, and so forth.

I recommend that in the BetaML classifiers one contruct probabilistic predictions by applying the UnivariateFinite(...) constructor (which can construct arrays as well as singletons) to the full matrix of probabilities (with all observations in it). You can see examples of this in all the MLJ probabilistic classifier interfaces. I am copying the doc-string for this constructor below:

cc @OkonSamuel

UnivariateFinite(support,
                 probs;
                 pool=nothing,
                 augmented=false,
                 ordered=false)

Construct a discrete univariate distribution whose finite support is
the elements of the vector support, and whose corresponding
probabilities are elements of the vector probs. Alternatively,
construct an abstract array of UnivariateFinite distributions by
choosing probs to be an array of one higher dimension than the array
generated.

Unless pool is specified, support should have type
AbstractVector{<:CategoricalValue} and all elements are assumed to
share the same categorical pool, which may be larger than support.

Important. All levels of the common pool have associated
probabilities, not just those in the specified support. However,
these probabilities are always zero (see example below).

If probs is a matrix, it should have a column for each class in
support (or one less, if augment=true). More generally, probs
will be an array whose size is of the form (n1, n2, ..., nk, c),
where c = length(support) (or one less, if augment=true) and the
constructor then returns an array of size (n1, n2, ..., nk).

using CategoricalArrays
v = categorical([:x, :x, :y, :x, :z])

julia> UnivariateFinite(classes(v), [0.2, 0.3, 0.5])
UnivariateFinite{Multiclass{3}}(x=>0.2, y=>0.3, z=>0.5)

julia> d = UnivariateFinite([v[1], v[end]], [0.1, 0.9])
UnivariateFinite{Multiclass{3}(x=>0.1, z=>0.9)

julia> rand(d, 3)
3-element Array{Any,1}:
 CategoricalArrays.CategoricalValue{Symbol,UInt32} :z
 CategoricalArrays.CategoricalValue{Symbol,UInt32} :z
 CategoricalArrays.CategoricalValue{Symbol,UInt32} :z

julia> levels(d)
3-element Array{Symbol,1}:
 :x
 :y
 :z

julia> pdf(d, :y)
0.0

Specifying a pool

Alternatively, support may be a list of raw (non-categorical)
elements if pool is:

some CategoricalArray, CategoricalValue or CategoricalPool,
such that support is a subset of levels(pool)
missing, in which case a new categorical pool is created which has
support as its only levels.

In the last case, specify ordered=true if the pool is to be
considered ordered.

julia> UnivariateFinite([:x, :z], [0.1, 0.9], pool=missing, ordered=true)
UnivariateFinite{OrderedFactor{2}}(x=>0.1, z=>0.9)

julia> d = UnivariateFinite([:x, :z], [0.1, 0.9], pool=v) # v defined above
UnivariateFinite(x=>0.1, z=>0.9) (Multiclass{3} samples)

julia> pdf(d, :y) # allowed as `:y in levels(v)`
0.0

v = categorical([:x, :x, :y, :x, :z, :w])
probs = rand(100, 3)
probs = probs ./ sum(probs, dims=2)
julia> UnivariateFinite([:x, :y, :z], probs, pool=v)
100-element UnivariateFiniteVector{Multiclass{4},Symbol,UInt32,Float64}:
 UnivariateFinite{Multiclass{4}}(x=>0.194, y=>0.3, z=>0.505)
 UnivariateFinite{Multiclass{4}}(x=>0.727, y=>0.234, z=>0.0391)
 UnivariateFinite{Multiclass{4}}(x=>0.674, y=>0.00535, z=>0.321)
   ⋮
 UnivariateFinite{Multiclass{4}}(x=>0.292, y=>0.339, z=>0.369)

Probability augmentation

Unless augment=true, sums of elements along the last axis (row-sums
in the case of a matrix) must be equal to one, and otherwise such an
array is created by inserting appropriate elements ahead of those
provided. This means the provided probabilities are associated with
the the classes c2, c3, ..., cn.

UnivariateFinite(prob_given_class; pool=nothing, ordered=false)

Construct a discrete univariate distribution whose finite support is
the set of keys of the provided dictionary, prob_given_class, and
whose values specify the corresponding probabilities.

The type requirements on the keys of the dictionary are the same as
the elements of support given above with this exception: if
non-categorical elements (raw labels) are used as keys, then
pool=... must be specified and cannot be missing.

If the values (probabilities) are arrays instead of scalars, then an
abstract array of UnivariateFinite elements is created, with the
same size as the array.

Rename/Alias `GeneralImputer` to `MICE`

The algorithm listed as GeneralImputer here is more widely-known as MICE (Multiple imputation by chained equations) in statistics. I'm not sure if the name used here is standard in ML, but the lack of a solid MICE implementation is a common complaint in the Julia statistics ecosystem, so I was very surprised to stumble across this pure-Julia implementation of MICE under a completely different name. Would it make sense to either rename or alias GeneralImputer to make this easier to discover?

Problem with MLJ interface for KMedoidsClusterer

import BetaML
using MLJTestInterface

@testset "generic mlj interface test" begin
    f, s = MLJTestInterface.test(
        [BetaML.Bmlj.KMeansClusterer,],
        MLJTestInterface.make_regression()[1];
        mod=@__MODULE__,
        verbosity=0, # bump to debug
        throw=true, # set to true to debug (`false` in CI)
    )
@test isempty(failures)
end

# generic mlj interface test: Error During Test at REPL[11]:1
#   Got exception outside of a @test
#   UndefVarError: `fitresults` not defined
#   Stacktrace:
#     [1] attempt(f::MLJTestInterface.var"#9#10"{BetaML.Bmlj.KMeansClusterer, Tuple{@NamedTuple{Rm::SubArray{Float64, 1, Matrix{Float64}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true}, LStat::SubArray{Float64, 1, Matrix{Float64}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true}}}}, message::String; throw::Bool)

< parts omitted for clarity >
  
#   caused by: UndefVarError: `fitresults` not defined
#   Stacktrace:
#     [1] fitted_params(model::BetaML.Bmlj.KMeansClusterer, fitresult::@NamedTuple{classes::Vector{Int64}, centers::Matrix{Float64}, distanceFunction::BetaML.Bmlj.var"#13#15"})
#       @ BetaML.Bmlj ~/.julia/packages/BetaML/SPPMQ/src/Bmlj/Clustering_mlj.jl:175
#     [2] fitted_params(mach::MLJBase.Machine{BetaML.Bmlj.KMeansClusterer, true})
#       @ MLJBase ~/.julia/packages/MLJBase/mIaqI/src/machines.jl:820
#     [3] (::MLJTestInterface.var"#9#10"{BetaML.Bmlj.KMeansClusterer, Tuple{@NamedTuple{Rm::SubArray{Float64, 1, Matrix{Float64}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true}, LStat::SubArray{Float64, 1, Matrix{Float64}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true}}}})()
#       @ MLJTestInterface ~/.julia/packages/MLJTestInterface/6i2JH/src/attemptors.jl:85
#     [4] attempt(f::MLJTestInterface.var"#9#10"{BetaML.Bmlj.KMeansClusterer, Tuple{@NamedTuple{Rm::SubArray{Float64, 1, Matrix{Float64}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true}, LStat::SubArray{Float64, 1, Matrix{Float64}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true}}}}, message::String; throw::Bool)
#       @ MLJTestInterface ~/.julia/packages/MLJTestInterface/6i2JH/src/attemptors.jl:15
#     [5] #fitted_machine#8
#       @ ~/.julia/packages/MLJTestInterface/6i2JH/src/attemptors.jl:77 [inlined]
#     [6] fitted_machine
#       @ ~/.julia/packages/MLJTestInterface/6i2JH/src/attemptors.jl:75 [inlined]
#     [7] test(model_types::Vector{DataType}, data::@NamedTuple{Rm::SubArray{Float64, 1, Matrix{Float64}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true}, LStat::SubArray{Float64, 1, Matrix{Float64}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true}}; mod::Module, level::Int64, throw::Bool, verbosity::Int64)                                                       
#       @ MLJTestInterface ~/.julia/packages/MLJTestInterface/6i2JH/src/test.jl:202
#     [8] macro expansion
#       @ REPL[11]:2 [inlined]
#     [9] macro expansion
#       @ /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
#    [10] top-level scope
#       @ REPL[11]:2
#    [11] eval
#       @ Core ./boot.jl:385 [inlined]
#    [12] eval_user_input(ast::Any, backend::REPL.REPLBackend, mod::Module)
#       @ REPL /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/REPL/src/REPL.jl:150                                                         
#    [13] repl_backend_loop(backend::REPL.REPLBackend, get_module::Function)
#       @ REPL /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/REPL/src/REPL.jl:246
#    [14] start_repl_backend(backend::REPL.REPLBackend, consumer::Any; get_module::Function)
#       @ REPL /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/REPL/src/REPL.jl:231
#    [15] run_repl(repl::AbstractREPL, consumer::Any; backend_on_current_task::Bool, backend::
# Any)                                                                                       
#       @ REPL /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/REPL/src/REPL.jl:389
#    [16] run_repl(repl::AbstractREPL, consumer::Any)
#       @ REPL /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/REPL/src/REPL.jl:375
#    [17] (::Base.var"#1013#1015"{Bool, Bool, Bool})(REPL::Module)
#       @ Base ./client.jl:432
#    [18] #invokelatest#2
#       @ Base ./essentials.jl:887 [inlined]
#    [19] invokelatest
#       @ Base ./essentials.jl:884 [inlined]
#    [20] run_main_repl(interactive::Bool, quiet::Bool, banner::Bool, history_file::Bool, color_set::Bool)                                                                            
#       @ Base ./client.jl:416
#    [21] exec_options(opts::Base.JLOptions)
#       @ Base ./client.jl:333
#    [22] _start()
#       @ Base ./client.jl:552
# Test Summary:              | Error  Total  Time
# generic mlj interface test |     1      1  6.6s
# ERROR: Some tests did not pass: 0 passed, 0 failed, 1 errored, 0 broken.

Example with GaussianMixtureClusterer

Can you please provide a full example with GaussianMixtureClusterer? I tried to instantiate the type but it is giving me an error saying m is not defined.

This code used to work:

using MLJ: @load

gmm = @load GMMClusterer pkg=BetaML verbosity=0

gmm(K=4)

Now I understand that the new model name is GaussianMixtureClusterer, but the construction is failing.

FYI: NNlib.jl; depend on it?

Hi,

I just happened to (so far) only contribute activation functions to your project. Not that I use it or any of the others. I would like to help the one project where it makes the biggest impact, or one central place and this may be it:

FluxML/NNlib.jl#224

BetaML v11.0 Gaussian Mixture Model not compatible with MLJ

I recently updated my packages and noticed that I couldn't create an MLJ machine with the Gaussian Mixture Model with BetaML v0.11.0. The older version v0.10.4 is working fine. I have not checked whether this is true for other models in BetaML

Reproducable example:

julia> using MLJ

julia> GMM = MLJ.@load GaussianMixtureClusterer pkg=BetaML verbosity=0
BetaML.GMM.GaussianMixtureClusterer

julia> machine(GMM(), rand(100, 10))
ERROR: MethodError: no method matching machine(::BetaML.GMM.GaussianMixtureClusterer, ::Matrix{Float64})

Closest candidates are:
  machine(::Type{<:Model}, ::Any...; kwargs...)
   @ MLJBase ~/.julia/packages/MLJBase/mIaqI/src/machines.jl:336
  machine(::Static, ::Any...; cache, kwargs...)
   @ MLJBase ~/.julia/packages/MLJBase/mIaqI/src/machines.jl:340
  machine(::Union{Symbol, Model}, ::Any, ::AbstractNode, ::AbstractNode...; kwargs...)
   @ MLJBase ~/.julia/packages/MLJBase/mIaqI/src/machines.jl:359
  ...

Stacktrace:
 [1] top-level scope
   @ REPL[4]:1

julia> using Pkg

julia> Pkg.status()
Project MLJ_debug v0.1.0
Status `~/tmp/MLJ_debug/Project.toml`
  [024491cd] BetaML v0.11.0
  [add582a8] MLJ v0.20.2

Trigger registration to Julia

@JuliaRegistrator register

Release notes:

First registered release of Bmlt, the Beta Machine Learning Toolkit

Add MLJ-compliant document strings

We are currently implementing detailed docstrings for all MLJ models, following a standard we have developed. See this issue: JuliaAI/MLJ.jl#913

@sylvaticus If it is helpful to you, @josephsdavid, who is helping us this summer as GSoD technical writer can prepare PRs for you to review. David is a working data scientist with some Julia knowledge. You will need to let me know soon if you would like this.

MLJ Interface is not working anymore

The code

modelType = @load RandomForestClassifier pkg = "BetaML" verbosity=1
mod = modelType(
n_trees = 2,
max_depth = 10
)

is not working in the latest version of BetaML.

initVarainces! doesn't support mixed-type variances

AS it is a template function, it is defined over a single eltype T of the mixtures vector.
Need to be refactored to work with mixed cases (if one really needs different mixture types for the different classes)

Random Forest does not appear to work

Could be doing something I'm not supposed to, but I can't seem to get this to work.

Platform details:

Julia: v1.5.1
BetaML: v0.3.0

Minimum example:

import BetaML

BetaML.Trees.buildForest(rand(100), rand(100))

ERROR: BoundsError: attempt to access (100,)
  at index [2]
Stacktrace:
 [1] indexed_iterate at .\tuple.jl:81 [inlined]
 [2] buildForest(::Array{Float64,1}, ::Array{Float64,1}, ::Int64; maxDepth::Int64, minGain::Float64, minRecords::Int64, maxFeatures::Int64, splittingCriterion::String, forceClassification::Bool) at C:\[...]\.julia\packages\BetaML\w0Pyx\src\Trees.jl:430
 [3] buildForest(::Array{Float64,1}, ::Array{Float64,1}, ::Int64) at C:\[...]\.julia\packages\BetaML\w0Pyx\src\Trees.jl:429 (repeats 2 times)
 [4] top-level scope at REPL[157]:1

Correct the predict in AutoEncoder to consider non-vector layer outputs

Change the loop order in cache/predict of the AutoEncoder to allow convolutional layers with non-array output in layers

Can we have floats rounded to 4 significant digits in decision tree displays?

See JuliaAI/DecisionTree.jl#197.

Scaler() of vectors (instead of matrices) result in errors

julia> fit!(Scaler(),[1,10,100])
ERROR: BoundsError: attempt to access Tuple{Int64} at index [2]
Stacktrace:
 [1] indexed_iterate
   @ ./tuple.jl:88 [inlined]
 [2] _fit(m::StandardScaler, skip::Vector{Int64}, X::Vector{Int64}, cache::Bool)
   @ BetaML.Utils ~/.julia/dev/BetaML/src/Utils/Processing.jl:645
 [3] fit!(m::Scaler, x::Vector{Int64})
   @ BetaML.Utils ~/.julia/dev/BetaML/src/Utils/Processing.jl:860
 [4] top-level scope
   @ REPL[17]:1

Deprecation warning from ProgressMeter.jl

┌ BetaML [024491cd-cc6b-443e-8034-08ea7eb7db2b]
│ ┌ Warning: Progress(n::Integer, dt::Real, desc::AbstractString = "Progress: ", barlen = nothing, color::Symbol = :green, output::IO = stderr; offset::Integer = 0) is deprecated, use Progress(n; dt = dt, desc = desc, barlen = barlen, color = color, output = output, offset = offset) instead.
│ │ caller = ip:0x0
│ └ @ Core :-1

`target_scitype` for MultitargetNeuralNetworkRegressor is too broad

Current scitype:

 target_scitype =
     AbstractVecOrMat{<:Union{ScientificTypesBase.Continuous, ScientificTypesBase.Count}},

which allows a vector as target. But using a vector throws an error:

model = BetaML.Nn.MultitargetNeuralNetworkRegressor();
X, y = make_regression();         # y is vector here
mach = machine(model, X, y)
fit!(mach)
[ Info: Training machine(MultitargetNeuralNetworkRegressor(layers = nothing, …), …).
┌ Error: Problem fitting the machine machine(MultitargetNeuralNetworkRegressor(layers = nothing, …), …). 
└ @ MLJBase ~/.julia/packages/MLJBase/97P9U/src/machines.jl:682
[ Info: Running type checks... 
[ Info: Type checks okay. 
ERROR: The label should have multiple dimensions. Use `NeuralNetworkRegressor` for single-dimensional outputs.
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:35
 [2] fit(m::BetaML.Nn.MultitargetNeuralNetworkRegressor, verbosity::Int64, X::Tables.MatrixTable{Matrix{Float64}}, y::Vector{Float64})                                    
   @ BetaML.Nn ~/.julia/packages/BetaML/mWUwE/src/Nn/Nn_MLJ.jl:206
 [3] fit_only!(mach::Machine{BetaML.Nn.MultitargetNeuralNetworkRegressor, true}; rows::Nothing, verbosity::Int64, force::Bool, composite::Nothing)                                 
   @ MLJBase ~/.julia/packages/MLJBase/97P9U/src/machines.jl:680
 [4] fit_only!
   @ ~/.julia/packages/MLJBase/97P9U/src/machines.jl:606 [inlined]
 [5] #fit!#63
   @ ~/.julia/packages/MLJBase/97P9U/src/machines.jl:778 [inlined]
 [6] fit!(mach::Machine{BetaML.Nn.MultitargetNeuralNetworkRegressor, true})
   @ MLJBase ~/.julia/packages/MLJBase/97P9U/src/machines.jl:775
 [7] top-level scope
   @ REPL[31]:1

One might also want to support tabular y here, which is what other MLJ multitarget models support.

Trouble interpolating feature names in a wrapped tree

What am I missing here?

using MLJ
import BetaML.Trees
import DataFrames as DF

table = OpenML.load(42638)
df = DF.select(DF.DataFrame(table), DF.Not(:cabin))

cleaner = FillImputer()
machc = machine(cleaner, df) |> fit!
dfc     =  transform(machc, df)

y, X = unpack(dfc, ==(:survived))

Tree = @load DecisionTreeClassifier pkg=BetaML
tree = Tree(max_depth=3)
mach = machine(tree, X, y) |> fit!

raw_tree = fitted_params(mach).fitresult[1]
wrapped_tree = Trees.wrap(raw_tree, (feature_names=DF.names(X),))

# 2 == female?
# ├─ 1 == 3?
# │  ├─ "1" => 0.5
# │  │  "0" => 0.5
# │  │
# │  └─ "1" => 0.9470588235294117
# │     "0" => 0.052941176470588235
# │
# └─ 3 >= 7.0?
#    ├─ "1" => 0.16817359855334538
#    │  "0" => 0.8318264014466547
#    │
#    └─ "1" => 0.6666666666666666
#       "0" => 0.3333333333333333

cc @roland-KA

Scaler() of Int matrix result in error

julia> using BetaML

julia> fit!(Scaler(),[ 1 10 100; 2 20 200; 3 30 300])
ERROR: InexactError: Int64(-1.224744871391589)
Stacktrace:
  [1] Int64
    @ ./float.jl:900 [inlined]
  [2] convert
    @ ./number.jl:7 [inlined]
  [3] setindex!
    @ ./array.jl:971 [inlined]
  [4] macro expansion
    @ ./multidimensional.jl:932 [inlined]
  [5] macro expansion
    @ ./cartesian.jl:64 [inlined]
  [6] _unsafe_setindex!(::IndexLinear, ::Matrix{Int64}, ::Vector{Float64}, ::Base.Slice{Base.OneTo{Int64}}, ::Int64)
    @ Base ./multidimensional.jl:927
  [7] _setindex!
    @ ./multidimensional.jl:916 [inlined]
  [8] setindex!
    @ ./abstractarray.jl:1397 [inlined]
  [9] _fit(m::StandardScaler, skip::Vector{Int64}, X::Matrix{Int64}, cache::Bool)
    @ BetaML.Utils ~/.julia/dev/BetaML/src/Utils/Processing.jl:656
 [10] fit!(m::Scaler, x::Matrix{Int64})
    @ BetaML.Utils ~/.julia/dev/BetaML/src/Utils/Processing.jl:860
 [11] top-level scope
    @ REPL[15]:1

WARNING: could not import Perceptron ...

During precompilation I encountered some warnings:

[ Info: Precompiling BetaML [024491cd-cc6b-443e-8034-08ea7eb7db2b]
WARNING: could not import Perceptron.KernelPerceptron into BetaML
WARNING: could not import Perceptron.KernelPerceptronHyperParametersSet into BetaML
WARNING: could not import Perceptron.Pegasos into BetaML
WARNING: could not import Perceptron.PegasosHyperParametersSet into BetaML

Maybe BetaML is importing names from Perceptron module that no longer exist?

MLJ model docstrings

I notice that examples in docstrings use thepredict and fit from MLJModelInterface (which are not exported by MLJ, and not intended for use by general MLJ user) rather than the machine fit!, predict, etc methods exported by MLJ. In this respect, these model docstrings differ from all the other MLJ model docstrings, so I'd consider them "uncompliant".

I understand this is some work to correct. Still, it would be great, for uniformity, to have these changed.

"`findall` is ambiguous" error

While working with BetaML, DataFrames and Chain, I found that importing BetaML leads to ambiguity in findall when working with the @chain macro.

using Chain, DataFrames

import BetaML as BML

df = DataFrame(randn(100, 3), :auto)

# This works
transform(df, All() => ByRow((x...) -> sum(x)) => :y)

# This fails
@chain df begin
	transform(_, All() => ByRow((x...) -> sum(x)) => :y)
end

I am not sure what the correct solution would be. The error log suggests defining findall(::F, ::Array{T}) where {T, F<:Function}, but I am not experienced in managing packages and therefore not sure if one would have to keep other things in mind.

Here is the full error log:

LoadError: MethodError: findall(::Chain.var"#4#5", ::Vector{Any}) is ambiguous.

Candidates:
  findall(testf::Function, A)
    @ Base array.jl:2439
  findall(testf::F, A::AbstractArray) where F<:Function
    @ Base array.jl:2447
  findall(el::T, cont::Array{T}; returnTuple) where T
    @ BetaML.Utils ~/.julia/packages/BetaML/QcevM/src/Utils/Processing.jl:73

Possible fix, define
  findall(::F, ::Array{T}) where {T, F<:Function}

Cosine distance

Is there not a typing error here?

BetaML.jl/src/Utils/Measures.jl

Lines 15 to 17 in 4bf2d55

 """Cosine distance""" 

 cosine_distance(x,y) = dot(x,y)/(norm(x)*norm(y)) 

 """

"""Cosine distance"""
cosine_distance(x,y) = dot(x,y)/(norm(x)*norm(y))
"""

I guess it should be:

"""Cosine distance"""
cosine_distance(x,y) = 1 - dot(x,y)/(norm(x)*norm(y))
"""

(if I well understood what you wanted to refer to as "cosine distance")

MLJ model `BetaMLGMMRegressor` predicting row vectors instead of column vectors

using MLJBase
using MLJModels
model = (@iload BetaMLGMMRegressor)()
X, y = make_regression();
mach = machine(model, X, y) |> fit!
yhat = predict(mach, X);

julia> l2(yhat, y)
ERROR: DimensionMismatch: Encountered two objects with sizes (100, 1) and (100,) which needed to match but don't. 
Stacktrace:
 [1] check_dimensions
   @ ~/.julia/packages/MLJBase/CtxrQ/src/utilities.jl:145 [inlined]
 [2] _check(measure::LPLoss{Int64}, yhat::Matrix{Float64}, y::Vector{Float64})
   @ MLJBase ~/.julia/packages/MLJBase/CtxrQ/src/measures/measures.jl:60
 [3] (::LPLoss{Int64})(::Matrix{Float64}, ::Vararg{Any})
   @ MLJBase ~/.julia/packages/MLJBase/CtxrQ/src/measures/measures.jl:126
 [4] top-level scope
   @ REPL[36]:1

	"""Cosine distance"""
	cosine_distance(x,y) = dot(x,y)/(norm(x)*norm(y))
	"""