Allow `verbosity` to be any integer?

I was surprised when re-running some ecosystem-wide integration tests to get this message when training these using the MLJ interface: MultitargetNeuralNetworkRegressor NeuralNetworkRegressor:

Wrong verbosity level. Verbosity must be either 0, 10, 20, 30 or 40

I was probably using verbosity =-1 to suppress warnings.

I understand MLJ spec is mostly silent on this, but in practice the rule has been : "With the exception of warnings, training should be silent if verbosity == 0. Lower values should suppress warnings" and I would add "any integer should be allowed".

Perhaps in the MLJ interface for the BetaML models one could map

<= 0 -> 0
1 -> 10
2 -> 20
3 -> 30
>= 5 -> 40

or similar ??

Separate into subpackages?

Specifically, I think separating the modules in this into subpackages (i.e. reexported as part of a larger overall BetaML package) would help a lot with discoverability; for instance, the problem I mentioned earlier of people in the stats community having lots of trouble finding the imputation methods here.

Trouble interpolating feature names in a wrapped tree

What am I missing here?

using MLJ
import BetaML.Trees
import DataFrames as DF

table = OpenML.load(42638)
df =, DF.Not(:cabin))

cleaner = FillImputer()
machc = machine(cleaner, df) |> fit!
dfc     =  transform(machc, df)

y, X = unpack(dfc, ==(:survived))

Tree = @load DecisionTreeClassifier pkg=BetaML
tree = Tree(max_depth=3)
mach = machine(tree, X, y) |> fit!

raw_tree = fitted_params(mach).fitresult[1]
wrapped_tree = Trees.wrap(raw_tree, (feature_names=DF.names(X),))

# 2 == female?
# ├─ 1 == 3?
# │  ├─ "1" => 0.5
# │  │  "0" => 0.5
# │  │
# │  └─ "1" => 0.9470588235294117
# │     "0" => 0.052941176470588235
# └─ 3 >= 7.0?
#    ├─ "1" => 0.16817359855334538
#    │  "0" => 0.8318264014466547
#    └─ "1" => 0.6666666666666666
#       "0" => 0.3333333333333333

cc @roland-KA

initVarainces! doesn't support mixed-type variances

AS it is a template function, it is defined over a single eltype T of the mixtures vector.
Need to be refactored to work with mixed cases (if one really needs different mixture types for the different classes)

MLJ Interface is not working anymore

The code

modelType = @load RandomForestClassifier pkg = "BetaML" verbosity=1
mod = modelType(
n_trees = 2,
max_depth = 10

is not working in the latest version of BetaML.

Error generating MLJ model registry

Running MLJModels.@update to update MLJ's model registry is running into this new error:

ERROR: LoadError: Bad `load_path` trait for BetaML.Imputation.BetaMLGMMImputer: BetaMLGMMImputer not a registered package. 
 [1] error(s::String)
   @ Base ./error.jl:33
 [2] top-level scope
   @ ~/MLJ/MLJModels/src/registry/src/update.jl:122
 [3] eval
   @ ./boot.jl:373 [inlined]
 [4] eval(x::Expr)
   @ Base.MainInclude ./client.jl:453
 [5] _update(mod::Module, test_env_only::Bool)
   @ MLJModels.Registry ~/MLJ/MLJModels/src/registry/src/update.jl:153
 [6] var"@update"(__source__::LineNumberNode, __module__::Module)
   @ MLJModels.Registry ~/MLJ/MLJModels/src/registry/src/update.jl:24
in expression starting at REPL[4]:1

BetaML v11.0 Gaussian Mixture Model not compatible with MLJ

I recently updated my packages and noticed that I couldn't create an MLJ machine with the Gaussian Mixture Model with BetaML v0.11.0. The older version v0.10.4 is working fine. I have not checked whether this is true for other models in BetaML

Reproducable example:

julia> using MLJ

julia> GMM = MLJ.@load GaussianMixtureClusterer pkg=BetaML verbosity=0

julia> machine(GMM(), rand(100, 10))
ERROR: MethodError: no method matching machine(::BetaML.GMM.GaussianMixtureClusterer, ::Matrix{Float64})

Closest candidates are:
  machine(::Type{<:Model}, ::Any...; kwargs...)
   @ MLJBase ~/.julia/packages/MLJBase/mIaqI/src/machines.jl:336
  machine(::Static, ::Any...; cache, kwargs...)
   @ MLJBase ~/.julia/packages/MLJBase/mIaqI/src/machines.jl:340
  machine(::Union{Symbol, Model}, ::Any, ::AbstractNode, ::AbstractNode...; kwargs...)
   @ MLJBase ~/.julia/packages/MLJBase/mIaqI/src/machines.jl:359

 [1] top-level scope
   @ REPL[4]:1

julia> using Pkg

julia> Pkg.status()
Project MLJ_debug v0.1.0
Status `~/tmp/MLJ_debug/Project.toml`
  [024491cd] BetaML v0.11.0
  [add582a8] MLJ v0.20.2

`target_scitype` for MultitargetNeuralNetworkRegressor is too broad

Current scitype:

 target_scitype =
     AbstractVecOrMat{<:Union{ScientificTypesBase.Continuous, ScientificTypesBase.Count}},

which allows a vector as target. But using a vector throws an error:

model = BetaML.Nn.MultitargetNeuralNetworkRegressor();
X, y = make_regression();         # y is vector here
mach = machine(model, X, y)
[ Info: Training machine(MultitargetNeuralNetworkRegressor(layers = nothing, ), ).
┌ Error: Problem fitting the machine machine(MultitargetNeuralNetworkRegressor(layers = nothing, ), ). 
└ @ MLJBase ~/.julia/packages/MLJBase/97P9U/src/machines.jl:682
[ Info: Running type checks... 
[ Info: Type checks okay. 
ERROR: The label should have multiple dimensions. Use `NeuralNetworkRegressor` for single-dimensional outputs.
 [1] error(s::String)
   @ Base ./error.jl:35
 [2] fit(m::BetaML.Nn.MultitargetNeuralNetworkRegressor, verbosity::Int64, X::Tables.MatrixTable{Matrix{Float64}}, y::Vector{Float64})                                    
   @ BetaML.Nn ~/.julia/packages/BetaML/mWUwE/src/Nn/Nn_MLJ.jl:206
 [3] fit_only!(mach::Machine{BetaML.Nn.MultitargetNeuralNetworkRegressor, true}; rows::Nothing, verbosity::Int64, force::Bool, composite::Nothing)                                 
   @ MLJBase ~/.julia/packages/MLJBase/97P9U/src/machines.jl:680
 [4] fit_only!
   @ ~/.julia/packages/MLJBase/97P9U/src/machines.jl:606 [inlined]
 [5] #fit!#63
   @ ~/.julia/packages/MLJBase/97P9U/src/machines.jl:778 [inlined]
 [6] fit!(mach::Machine{BetaML.Nn.MultitargetNeuralNetworkRegressor, true})
   @ MLJBase ~/.julia/packages/MLJBase/97P9U/src/machines.jl:775
 [7] top-level scope
   @ REPL[31]:1

One might also want to support tabular y here, which is what other MLJ multitarget models support.

MLJ traits for GMMClusterer

The GMMClusterer is an unsupervised probabilistic model. However we can't check that programmatically because of JuliaAI/MLJModelInterface.jl#120

Is there any fix to make sure that both KMeans and GMMClusterer return a set of categorical values? Right now predict(Kmeans(), ...) will return a vector of categorical values whereas predict(GMMClusterer(), ...) will return a vector of distributions.

FYI: NNlib.jl; depend on it?


I just happened to (so far) only contribute activation functions to your project. Not that I use it or any of the others. I would like to help the one project where it makes the biggest impact, or one central place and this may be it:


MLJ model `BetaMLGMMRegressor` predicting row vectors instead of column vectors

using MLJBase
using MLJModels
model = (@iload BetaMLGMMRegressor)()
X, y = make_regression();
mach = machine(model, X, y) |> fit!
yhat = predict(mach, X);

julia> l2(yhat, y)
ERROR: DimensionMismatch: Encountered two objects with sizes (100, 1) and (100,) which needed to match but don't. 
 [1] check_dimensions
   @ ~/.julia/packages/MLJBase/CtxrQ/src/utilities.jl:145 [inlined]
 [2] _check(measure::LPLoss{Int64}, yhat::Matrix{Float64}, y::Vector{Float64})
   @ MLJBase ~/.julia/packages/MLJBase/CtxrQ/src/measures/measures.jl:60
 [3] (::LPLoss{Int64})(::Matrix{Float64}, ::Vararg{Any})
   @ MLJBase ~/.julia/packages/MLJBase/CtxrQ/src/measures/measures.jl:126
 [4] top-level scope
   @ REPL[36]:1

WARNING: could not import Perceptron ...

During precompilation I encountered some warnings:

[ Info: Precompiling BetaML [024491cd-cc6b-443e-8034-08ea7eb7db2b]
WARNING: could not import Perceptron.KernelPerceptron into BetaML
WARNING: could not import Perceptron.KernelPerceptronHyperParametersSet into BetaML
WARNING: could not import Perceptron.Pegasos into BetaML
WARNING: could not import Perceptron.PegasosHyperParametersSet into BetaML

Maybe BetaML is importing names from Perceptron module that no longer exist?

Avoid observation-by-observation construction of UnivariateFinite objects in MLJ interface

The overhead for constructing UnivariateFinite objects one at a time is very high. For this reason a UnivariateFiniteArray implementation of AbstractArray{<:UnivariateFinite} was developed. This includes optimised implementations of broadcasting pdf, and so forth.

I recommend that in the BetaML classifiers one contruct probabilistic predictions by applying the UnivariateFinite(...) constructor (which can construct arrays as well as singletons) to the full matrix of probabilities (with all observations in it). You can see examples of this in all the MLJ probabilistic classifier interfaces. I am copying the doc-string for this constructor below:

cc @OkonSamuel


Construct a discrete univariate distribution whose finite support is
the elements of the vector support, and whose corresponding
probabilities are elements of the vector probs. Alternatively,
construct an abstract array of UnivariateFinite distributions by
choosing probs to be an array of one higher dimension than the array

Unless pool is specified, support should have type
AbstractVector{<:CategoricalValue} and all elements are assumed to
share the same categorical pool, which may be larger than support.

Important. All levels of the common pool have associated
probabilities, not just those in the specified support. However,
these probabilities are always zero (see example below).

If probs is a matrix, it should have a column for each class in
support (or one less, if augment=true). More generally, probs
will be an array whose size is of the form (n1, n2, ..., nk, c),
where c = length(support) (or one less, if augment=true) and the
constructor then returns an array of size (n1, n2, ..., nk).

using CategoricalArrays
v = categorical([:x, :x, :y, :x, :z])

julia> UnivariateFinite(classes(v), [0.2, 0.3, 0.5])
UnivariateFinite{Multiclass{3}}(x=>0.2, y=>0.3, z=>0.5)

julia> d = UnivariateFinite([v[1], v[end]], [0.1, 0.9])
UnivariateFinite{Multiclass{3}(x=>0.1, z=>0.9)

julia> rand(d, 3)
3-element Array{Any,1}:
 CategoricalArrays.CategoricalValue{Symbol,UInt32} :z
 CategoricalArrays.CategoricalValue{Symbol,UInt32} :z
 CategoricalArrays.CategoricalValue{Symbol,UInt32} :z

julia> levels(d)
3-element Array{Symbol,1}:

julia> pdf(d, :y)

Specifying a pool

Alternatively, support may be a list of raw (non-categorical)
elements if pool is:

  • some CategoricalArray, CategoricalValue or CategoricalPool,
    such that support is a subset of levels(pool)

  • missing, in which case a new categorical pool is created which has
    support as its only levels.

In the last case, specify ordered=true if the pool is to be
considered ordered.

julia> UnivariateFinite([:x, :z], [0.1, 0.9], pool=missing, ordered=true)
UnivariateFinite{OrderedFactor{2}}(x=>0.1, z=>0.9)

julia> d = UnivariateFinite([:x, :z], [0.1, 0.9], pool=v) # v defined above
UnivariateFinite(x=>0.1, z=>0.9) (Multiclass{3} samples)

julia> pdf(d, :y) # allowed as `:y in levels(v)`

v = categorical([:x, :x, :y, :x, :z, :w])
probs = rand(100, 3)
probs = probs ./ sum(probs, dims=2)
julia> UnivariateFinite([:x, :y, :z], probs, pool=v)
100-element UnivariateFiniteVector{Multiclass{4},Symbol,UInt32,Float64}:
 UnivariateFinite{Multiclass{4}}(x=>0.194, y=>0.3, z=>0.505)
 UnivariateFinite{Multiclass{4}}(x=>0.727, y=>0.234, z=>0.0391)
 UnivariateFinite{Multiclass{4}}(x=>0.674, y=>0.00535, z=>0.321)
 UnivariateFinite{Multiclass{4}}(x=>0.292, y=>0.339, z=>0.369)

Probability augmentation

Unless augment=true, sums of elements along the last axis (row-sums
in the case of a matrix) must be equal to one, and otherwise such an
array is created by inserting appropriate elements ahead of those
provided. This means the provided probabilities are associated with
the the classes c2, c3, ..., cn.

UnivariateFinite(prob_given_class; pool=nothing, ordered=false)

Construct a discrete univariate distribution whose finite support is
the set of keys of the provided dictionary, prob_given_class, and
whose values specify the corresponding probabilities.

The type requirements on the keys of the dictionary are the same as
the elements of support given above with this exception: if
non-categorical elements (raw labels) are used as keys, then
pool=... must be specified and cannot be missing.

If the values (probabilities) are arrays instead of scalars, then an
abstract array of UnivariateFinite elements is created, with the
same size as the array.

The input scitypes for trees are incorrect

From here:

    input_scitype    = MMI.Table(MMI.Missing, MMI.Known),           # also ok: MMI.Table(Union{MMI.Missing, MMI.Known}),

What is written in the comment is correct. What is actually used is not:

julia> X = (; x=[missing, 1, 2])
(x = Union{Missing, Int64}[missing, 1, 2],)

julia> scitype(X) <: Table(Missing, Known)

julia> scitype(X) <: Table(Union{Missing, Known})

For more on the Table scitype constructor, see here.

All the tree scitypes need changing.

Sorry that I did not pick this up in my review.

Example with GaussianMixtureClusterer

Can you please provide a full example with GaussianMixtureClusterer? I tried to instantiate the type but it is giving me an error saying m is not defined.

This code used to work:

using MLJ: @load

gmm = @load GMMClusterer pkg=BetaML verbosity=0


Now I understand that the new model name is GaussianMixtureClusterer, but the construction is failing.

"`findall` is ambiguous" error

While working with BetaML, DataFrames and Chain, I found that importing BetaML leads to ambiguity in findall when working with the @chain macro.

using Chain, DataFrames

import BetaML as BML

df = DataFrame(randn(100, 3), :auto)

# This works
transform(df, All() => ByRow((x...) -> sum(x)) => :y)

# This fails
@chain df begin
	transform(_, All() => ByRow((x...) -> sum(x)) => :y)

I am not sure what the correct solution would be. The error log suggests defining findall(::F, ::Array{T}) where {T, F<:Function}, but I am not experienced in managing packages and therefore not sure if one would have to keep other things in mind.

Here is the full error log:

LoadError: MethodError: findall(::Chain.var"#4#5", ::Vector{Any}) is ambiguous.

  findall(testf::Function, A)
    @ Base array.jl:2439
  findall(testf::F, A::AbstractArray) where F<:Function
    @ Base array.jl:2447
  findall(el::T, cont::Array{T}; returnTuple) where T
    @ BetaML.Utils ~/.julia/packages/BetaML/QcevM/src/Utils/Processing.jl:73

Possible fix, define
  findall(::F, ::Array{T}) where {T, F<:Function}

Improve oneHotEncode stability for encoding integers embedding categories

julia> oneHotEncoder([-1,1,1])
ERROR: BoundsError: attempt to access 1-element Vector{Int64} at index [-1]
 [1] setindex!
   @ ./array.jl:903 [inlined]
 [2] oneHotEncoderRow(x::Int64; d::Int64, factors::UnitRange{Int64}, count::Bool)
   @ BetaML.Utils ~/.julia/packages/BetaML/cpTAz/src/Utils/Processing.jl:64
 [3] oneHotEncoder(Y::Vector{Int64}; d::Int64, factors::UnitRange{Int64}, count::Bool)
   @ BetaML.Utils ~/.julia/packages/BetaML/cpTAz/src/Utils/Processing.jl:127
 [4] oneHotEncoder(Y::Vector{Int64})
   @ BetaML.Utils ~/.julia/packages/BetaML/cpTAz/src/Utils/Processing.jl:121
 [5] top-level scope
   @ REPL[5]:1

julia> oneHotEncoder([-1,1,1],factors=[-1,1])
ERROR: BoundsError: attempt to access 1-element Vector{Int64} at index [-1]
 [1] setindex!
   @ ./array.jl:903 [inlined]
 [2] oneHotEncoderRow(x::Int64; d::Int64, factors::UnitRange{Int64}, count::Bool)
   @ BetaML.Utils ~/.julia/packages/BetaML/cpTAz/src/Utils/Processing.jl:64
 [3] oneHotEncoder(Y::Vector{Int64}; d::Int64, factors::Vector{Int64}, count::Bool)
   @ BetaML.Utils ~/.julia/packages/BetaML/cpTAz/src/Utils/Processing.jl:127
 [4] top-level scope
   @ REPL[6]:1

Scaler() of Int matrix result in error

julia> using BetaML

julia> fit!(Scaler(),[ 1 10 100; 2 20 200; 3 30 300])
ERROR: InexactError: Int64(-1.224744871391589)
  [1] Int64
    @ ./float.jl:900 [inlined]
  [2] convert
    @ ./number.jl:7 [inlined]
  [3] setindex!
    @ ./array.jl:971 [inlined]
  [4] macro expansion
    @ ./multidimensional.jl:932 [inlined]
  [5] macro expansion
    @ ./cartesian.jl:64 [inlined]
  [6] _unsafe_setindex!(::IndexLinear, ::Matrix{Int64}, ::Vector{Float64}, ::Base.Slice{Base.OneTo{Int64}}, ::Int64)
    @ Base ./multidimensional.jl:927
  [7] _setindex!
    @ ./multidimensional.jl:916 [inlined]
  [8] setindex!
    @ ./abstractarray.jl:1397 [inlined]
  [9] _fit(m::StandardScaler, skip::Vector{Int64}, X::Matrix{Int64}, cache::Bool)
    @ BetaML.Utils ~/.julia/dev/BetaML/src/Utils/Processing.jl:656
 [10] fit!(m::Scaler, x::Matrix{Int64})
    @ BetaML.Utils ~/.julia/dev/BetaML/src/Utils/Processing.jl:860
 [11] top-level scope
    @ REPL[15]:1

Rename Nn to NN

Since you where renaming the package only two days ago, should NN also be used (more appropriate as acronym, like RBF).

I'm not sure it's advised to have both (a new module adding the old). If you change, or if not, could something like:

module Nn
 [init] throw("do use: using NN")

work or wise versa?

Rename/Alias `GeneralImputer` to `MICE`

The algorithm listed as GeneralImputer here is more widely-known as MICE (Multiple imputation by chained equations) in statistics. I'm not sure if the name used here is standard in ML, but the lack of a solid MICE implementation is a common complaint in the Julia statistics ecosystem, so I was very surprised to stumble across this pure-Julia implementation of MICE under a completely different name. Would it make sense to either rename or alias GeneralImputer to make this easier to discover?

MLJ interface: fit should not mutate model fields

In MLJ learned parameters are distinct from hyper-parameters. A "model" in MLJ is a container for hyper-parameters and that is all.

For this reason, there should be no reason should to mutate model fields and the original API forbade this (Unfortunately, this rule seems to have disappeared from the docs JuliaAI/MLJ.jl#755). Only clean! can mutate the fields, and only if they don't make sense. One execption is that fit may mutate a RNG.

So this is currently non-compliant:

using Pkg
Pkg.add(name="BetaML", rev="master")

using MLJBase
import BetaML

model = BetaML.Clustering.MissingImputator()
mixtures = deepcopy(model.mixtures)

X = [1 10.5;1.5 missing; 1.8 8; 1.7 15; 3.2 40; missing missing; 3.3 38;
     missing -2.3; 5.2 -2.4] |> MLJBase.table

mach = machine(model, X) |> fit!

julia> @assert model.mixtures == mixtures
ERROR: AssertionError: model.mixtures == mixtures
    [1] top-level scope at REPL[40]:1

Maybe can begin by creating a deepcopy of mixtures and p₀, in this and the related models.

Corner case for KernelPerceptronClassifier: unique target class

If only one class is seen in the training data, the model fits okay, but prediction fails. I wonder if this is something that could be supported. Encountered this issue when doing cv for a very small binary classification problem (crabs).

using MLJ

Model = @load KernelPerceptronClassifier

model = Model()

X = (x=rand(10), );

y = coerce(collect("aaaaaaaaaab"), Multiclass)[1:10];

julia> unique(y)
1-element Vector{Char}:
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)

julia> levels(y)
2-element Vector{Char}:
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)

# works fine:
mach = machine(model, X, y) |> fit!;

# problem:
julia> predict_mode(mach, X)
ERROR: BoundsError: attempt to access 0-element Vector{Matrix{Float64}} at index [1]
 [1] getindex
   @ ./array.jl:861 [inlined]
 [2] predict(x::Matrix{Float64}, xtrain::Vector{Matrix{Float64}}, ytrain::Vector{Vector{Int64}}, α::Vector{Vector{Int64}}, classes::Vector{Char}; K::typeof(BetaML.Utils.radialKernel))
   @ BetaML.Perceptron ~/.julia/packages/BetaML/AeLyL/src/Perceptron/Perceptron.jl:622
 [3] predict(model::BetaML.Perceptron.KernelPerceptronClassifier, fitresult::Tuple{NamedTuple{(:x, :y, , :classes, :K), Tuple{Vector{Matrix{Float64}}, Vector{Vector{Int64}}, Vector{Vector{Int64}}, Vector{Char}, typeof(BetaML.Utils.radialKernel)}}, Vector{Char}}, Xnew::NamedTuple{(:x,), Tuple{Vector{Float64}}})                                                  
   @ BetaML.Perceptron ~/.julia/packages/BetaML/AeLyL/src/Perceptron/Perceptron_MLJ.jl:137
 [4] predict_mode(m::BetaML.Perceptron.KernelPerceptronClassifier, fitresult::Tuple{NamedTuple{(:x, :y, , :classes, :K), Tuple{Vector{Matrix{Float64}}, Vector{Vector{Int64}}, Vector{Vector{Int64}}, Vector{Char}, typeof(BetaML.Utils.radialKernel)}}, Vector{Char}}, Xnew::NamedTuple{(:x,), Tuple{Vector{Float64}}})                                                  
   @ MLJBase ~/MLJ/MLJBase/src/interface/model_api.jl:11
 [5] predict_mode(mach::Machine{BetaML.Perceptron.KernelPerceptronClassifier, true}, Xraw::NamedTuple{(:x,), Tuple{Vector{Float64}}})                                                  
   @ MLJBase ~/MLJ/MLJBase/src/operations.jl:85
 [6] top-level scope
   @ REPL[39]:1

MLJ model docstrings

I notice that examples in docstrings use thepredict and fit from MLJModelInterface (which are not exported by MLJ, and not intended for use by general MLJ user) rather than the machine fit!, predict, etc methods exported by MLJ. In this respect, these model docstrings differ from all the other MLJ model docstrings, so I'd consider them "uncompliant".

I understand this is some work to correct. Still, it would be great, for uniformity, to have these changed.

Cosine distance

Is there not a typing error here?

"""Cosine distance"""
cosine_distance(x,y) = dot(x,y)/(norm(x)*norm(y))

"""Cosine distance"""
cosine_distance(x,y) = dot(x,y)/(norm(x)*norm(y))

I guess it should be:

"""Cosine distance"""
cosine_distance(x,y) = 1 - dot(x,y)/(norm(x)*norm(y))

(if I well understood what you wanted to refer to as "cosine distance")

Add MLJ-compliant document strings

We are currently implementing detailed docstrings for all MLJ models, following a standard we have developed. See this issue: JuliaAI/MLJ.jl#913

@sylvaticus If it is helpful to you, @josephsdavid, who is helping us this summer as GSoD technical writer can prepare PRs for you to review. David is a working data scientist with some Julia knowledge. You will need to let me know soon if you would like this.

Scaler() of vectors (instead of matrices) result in errors

julia> fit!(Scaler(),[1,10,100])
ERROR: BoundsError: attempt to access Tuple{Int64} at index [2]
 [1] indexed_iterate
   @ ./tuple.jl:88 [inlined]
 [2] _fit(m::StandardScaler, skip::Vector{Int64}, X::Vector{Int64}, cache::Bool)
   @ BetaML.Utils ~/.julia/dev/BetaML/src/Utils/Processing.jl:645
 [3] fit!(m::Scaler, x::Vector{Int64})
   @ BetaML.Utils ~/.julia/dev/BetaML/src/Utils/Processing.jl:860
 [4] top-level scope
   @ REPL[17]:1

MLJ interface for `KernelPerceptronClassifier` is not tracking all target levels

julia> using MLJ

julia> Model = @load KernelPerceptronClassifier
[ Info: For silent loading, specify `verbosity=0`. 
import BetaML ✔

julia> model = Model()
  K = BetaML.Utils.radialKernel, 
  maxEpochs = 100, 
  initialα = Int64[], 
  shuffle = false, 
  rng = Random._GLOBAL_RNG())

julia> X = (x=rand(10), );

julia> y = coerce(collect("abababababcc"), Multiclass)[1:10];

julia> unique(y)
2-element Vector{Char}:
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)

julia> levels(y)
3-element Vector{Char}:
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
 'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)

julia> mach = machine(model, X, y) |> fit!;
[ Info: Training machine(KernelPerceptronClassifier(K = radialKernel, ), ).

julia> predict_mode(mach, X) |> levels
2-element Vector{Char}:
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)

That last indicates a bug, as all levels in the pool of the training vector should be present in the pool of the predictions.

Curiously in other classifiers I looked at, the levels are indeed being tracked correctly. So perhaps have a look at, eg, the BetaML DecisionTreeClassifier to see how this can be corrected.

This bug is causing a failure when the model is bagged in an ensemble using EnsembleModel because some classes are not present in some of the bagged observations, but are present in others.

Deprecation warning from ProgressMeter.jl

┌ BetaML [024491cd-cc6b-443e-8034-08ea7eb7db2b]
│ ┌ Warning: Progress(n::Integer, dt::Real, desc::AbstractString = "Progress: ", barlen = nothing, color::Symbol = :green, output::IO = stderr; offset::Integer = 0) is deprecated, use Progress(n; dt = dt, desc = desc, barlen = barlen, color = color, output = output, offset = offset) instead.
│ │ caller = ip:0x0
│ └ @ Core :-1

Error during precompilation (ERROR: LoadError: InitError: Evaluation into the closed module `Perceptron` ...)

I wanted to use BetaML.jl in a project, however when I try doing so I get the following error:

julia> using Foo
[ Info: Precompiling Foo [4817f03b-69bd-4595-9d0a-a711fd8a192f]
ERROR: LoadError: InitError: Evaluation into the closed module `Perceptron` breaks incremental compilation because the side effects will not be permanent. This is likely due to some other module mutating `Perceptron` with `eval` during precompilation - don't do this.
  [1] eval
    @ ./boot.jl:368 [inlined]
  [2] eval(x::Expr)
    @ BetaML.Perceptron ~/.julia/packages/BetaML/mqBvh/src/Perceptron/Perceptron.jl:19
  [3] metadata_pkg(T::Type; name::String, uuid::String, url::String, julia::Bool, license::String, is_wrapper::Bool, package_name::String, package_uuid::String, package_url::String, is_pure_julia::Bool, package_license::String)
    @ MLJModelInterface ~/.julia/packages/MLJModelInterface/wwFA9/src/metadata_utils.jl:54
  [4] #41
    @ ./broadcast.jl:1284 [inlined]
  [5] _broadcast_getindex_evalf
    @ ./broadcast.jl:670 [inlined]
  [6] _broadcast_getindex
    @ ./broadcast.jl:643 [inlined]
  [7] #29
    @ ./broadcast.jl:1075 [inlined]
  [8] macro expansion
    @ ./ntuple.jl:74 [inlined]
  [9] ntuple
    @ ./ntuple.jl:69 [inlined]
 [10] copy
    @ ./broadcast.jl:1075 [inlined]
 [11] materialize
    @ ./broadcast.jl:860 [inlined]
 [12] __init__()
    @ BetaML ~/.julia/packages/BetaML/mqBvh/src/BetaML.jl:63
 [13] _include_from_serialized(pkg::Base.PkgId, path::String, depmods::Vector{Any})
    @ Base ./loading.jl:831
 [14] _require_search_from_serialized(pkg::Base.PkgId, sourcepath::String, build_id::UInt64)
    @ Base ./loading.jl:1039
 [15] _require(pkg::Base.PkgId)
    @ Base ./loading.jl:1315
 [16] _require_prelocked(uuidkey::Base.PkgId)
    @ Base ./loading.jl:1200
 [17] macro expansion
    @ ./loading.jl:1180 [inlined]
 [18] macro expansion
    @ ./lock.jl:223 [inlined]
 [19] require(into::Module, mod::Symbol)
    @ Base ./loading.jl:1144
 [20] include
    @ ./Base.jl:419 [inlined]
 [21] include_package_for_output(pkg::Base.PkgId, input::String, depot_path::Vector{String}, dl_load_path::Vector{String}, load_path::Vector{String}, concrete_deps::Vector{Pair{Base.PkgId, UInt64}}, source::Nothing)
    @ Base ./loading.jl:1554
 [22] top-level scope
    @ stdin:1
during initialization of module BetaML
in expression starting at /data_temp/picaud/Temp/Beta/Foo.jl/src/Foo.jl:1
in expression starting at stdin:1
ERROR: Failed to precompile Foo [4817f03b-69bd-4595-9d0a-a711fd8a192f] to /home/picaud/.julia/compiled/v1.8/Foo/jl_a1tr7Z.
 [1] error(s::String)
   @ Base ./error.jl:35
 [2] compilecache(pkg::Base.PkgId, path::String, internal_stderr::IO, internal_stdout::IO, keep_loaded_modules::Bool)
   @ Base ./loading.jl:1707
 [3] compilecache
   @ ./loading.jl:1651 [inlined]
 [4] _require(pkg::Base.PkgId)
   @ Base ./loading.jl:1337
 [5] _require_prelocked(uuidkey::Base.PkgId)
   @ Base ./loading.jl:1200
 [6] macro expansion
   @ ./loading.jl:1180 [inlined]
 [7] macro expansion
   @ ./lock.jl:223 [inlined]
 [8] require(into::Module, mod::Symbol)
   @ Base ./loading.jl:1144

The error is not present when I remove precompilation, the BetaML.jl "patch" is:

# function __init__()
#     MMI.metadata_pkg.(MLJ_INTERFACED_MODELS,
#         name       = "BetaML",
#         uuid       = "024491cd-cc6b-443e-8034-08ea7eb7db2b",     # see your Project.toml
#         url        = "",  # URL to your package repo
#         julia      = true,     # is it written entirely in Julia?
#         license    = "MIT",    # your package license
#         is_wrapper = false,    # does it wrap around some other package?
#     )
# end

Steps to reproduce :

Create a local package Foo (in /tmp/ by example)

cd /tmp

(@v1.8) pkg> generate Foo.jl
  Generating  project Foo:

(@v1.8) pkg> activate ./Foo.jl/
  Activating project at `/tmp/Foo.jl`

(Foo) pkg> add BetaML

(Foo) pkg> activate 
  Activating project at `~/.julia/environments/v1.8`

(@v1.8) pkg> dev ./Foo.jl/
   Resolving package versions...

Then modify Foo.jl as follows :

module Foo

using BetaML # <---- here

greet() = print("Hello World!")

end # module Foo

Then from Julia type

julia> using Foo

and I (and maybe you) will get the error I mentioned at the beginning.


Problem with MLJ interface for KMedoidsClusterer

import BetaML
using MLJTestInterface

@testset "generic mlj interface test" begin
    f, s = MLJTestInterface.test(
        verbosity=0, # bump to debug
        throw=true, # set to true to debug (`false` in CI)
@test isempty(failures)

# generic mlj interface test: Error During Test at REPL[11]:1
#   Got exception outside of a @test
#   UndefVarError: `fitresults` not defined
#   Stacktrace:
#     [1] attempt(f::MLJTestInterface.var"#9#10"{BetaML.Bmlj.KMeansClusterer, Tuple{@NamedTuple{Rm::SubArray{Float64, 1, Matrix{Float64}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true}, LStat::SubArray{Float64, 1, Matrix{Float64}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true}}}}, message::String; throw::Bool)

< parts omitted for clarity >
#   caused by: UndefVarError: `fitresults` not defined
#   Stacktrace:
#     [1] fitted_params(model::BetaML.Bmlj.KMeansClusterer, fitresult::@NamedTuple{classes::Vector{Int64}, centers::Matrix{Float64}, distanceFunction::BetaML.Bmlj.var"#13#15"})
#       @ BetaML.Bmlj ~/.julia/packages/BetaML/SPPMQ/src/Bmlj/Clustering_mlj.jl:175
#     [2] fitted_params(mach::MLJBase.Machine{BetaML.Bmlj.KMeansClusterer, true})
#       @ MLJBase ~/.julia/packages/MLJBase/mIaqI/src/machines.jl:820
#     [3] (::MLJTestInterface.var"#9#10"{BetaML.Bmlj.KMeansClusterer, Tuple{@NamedTuple{Rm::SubArray{Float64, 1, Matrix{Float64}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true}, LStat::SubArray{Float64, 1, Matrix{Float64}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true}}}})()
#       @ MLJTestInterface ~/.julia/packages/MLJTestInterface/6i2JH/src/attemptors.jl:85
#     [4] attempt(f::MLJTestInterface.var"#9#10"{BetaML.Bmlj.KMeansClusterer, Tuple{@NamedTuple{Rm::SubArray{Float64, 1, Matrix{Float64}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true}, LStat::SubArray{Float64, 1, Matrix{Float64}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true}}}}, message::String; throw::Bool)
#       @ MLJTestInterface ~/.julia/packages/MLJTestInterface/6i2JH/src/attemptors.jl:15
#     [5] #fitted_machine#8
#       @ ~/.julia/packages/MLJTestInterface/6i2JH/src/attemptors.jl:77 [inlined]
#     [6] fitted_machine
#       @ ~/.julia/packages/MLJTestInterface/6i2JH/src/attemptors.jl:75 [inlined]
#     [7] test(model_types::Vector{DataType}, data::@NamedTuple{Rm::SubArray{Float64, 1, Matrix{Float64}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true}, LStat::SubArray{Float64, 1, Matrix{Float64}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true}}; mod::Module, level::Int64, throw::Bool, verbosity::Int64)                                                       
#       @ MLJTestInterface ~/.julia/packages/MLJTestInterface/6i2JH/src/test.jl:202
#     [8] macro expansion
#       @ REPL[11]:2 [inlined]
#     [9] macro expansion
#       @ /Applications/ [inlined]
#    [10] top-level scope
#       @ REPL[11]:2
#    [11] eval
#       @ Core ./boot.jl:385 [inlined]
#    [12] eval_user_input(ast::Any, backend::REPL.REPLBackend, mod::Module)
#       @ REPL /Applications/                                                         
#    [13] repl_backend_loop(backend::REPL.REPLBackend, get_module::Function)
#       @ REPL /Applications/
#    [14] start_repl_backend(backend::REPL.REPLBackend, consumer::Any; get_module::Function)
#       @ REPL /Applications/
#    [15] run_repl(repl::AbstractREPL, consumer::Any; backend_on_current_task::Bool, backend::
# Any)                                                                                       
#       @ REPL /Applications/
#    [16] run_repl(repl::AbstractREPL, consumer::Any)
#       @ REPL /Applications/
#    [17] (::Base.var"#1013#1015"{Bool, Bool, Bool})(REPL::Module)
#       @ Base ./client.jl:432
#    [18] #invokelatest#2
#       @ Base ./essentials.jl:887 [inlined]
#    [19] invokelatest
#       @ Base ./essentials.jl:884 [inlined]
#    [20] run_main_repl(interactive::Bool, quiet::Bool, banner::Bool, history_file::Bool, color_set::Bool)                                                                            
#       @ Base ./client.jl:416
#    [21] exec_options(opts::Base.JLOptions)
#       @ Base ./client.jl:333
#    [22] _start()
#       @ Base ./client.jl:552
# Test Summary:              | Error  Total  Time
# generic mlj interface test |     1      1  6.6s
# ERROR: Some tests did not pass: 0 passed, 0 failed, 1 errored, 0 broken.

Random Forest does not appear to work

Could be doing something I'm not supposed to, but I can't seem to get this to work.

Platform details:

Julia: v1.5.1
BetaML: v0.3.0

Minimum example:

import BetaML

BetaML.Trees.buildForest(rand(100), rand(100))

ERROR: BoundsError: attempt to access (100,)
  at index [2]
 [1] indexed_iterate at .\tuple.jl:81 [inlined]
 [2] buildForest(::Array{Float64,1}, ::Array{Float64,1}, ::Int64; maxDepth::Int64, minGain::Float64, minRecords::Int64, maxFeatures::Int64, splittingCriterion::String, forceClassification::Bool) at C:\[...]\.julia\packages\BetaML\w0Pyx\src\Trees.jl:430
 [3] buildForest(::Array{Float64,1}, ::Array{Float64,1}, ::Int64) at C:\[...]\.julia\packages\BetaML\w0Pyx\src\Trees.jl:429 (repeats 2 times)
 [4] top-level scope at REPL[157]:1

