Git Product home page Git Product logo

smartcorelib / smartcore Goto Github PK

View Code? Open in Web Editor NEW
650.0 14.0 74.0 1.6 MB

A comprehensive library for machine learning and numerical computing. The library provides a set of tools for linear algebra, numerical computing, optimization, and enables a generic, powerful yet still efficient approach to machine learning.

Home Page: https://smartcorelib.org/

License: Apache License 2.0

Rust 100.00%
machine-learning machine-learning-algorithms statistical-learning statistical-models rust rust-lang clustering classification regression model-selection

smartcore's Introduction

smartcore's People

Contributors

atcol avatar ckatsak avatar cmccomb avatar corebreaker avatar dependabot-preview[bot] avatar dependabot[bot] avatar ferrouille avatar gaxler avatar kiraneiden avatar mec-is avatar mlondschien avatar montanalow avatar morenol avatar rabbitrabid avatar rick68 avatar rnowling avatar rubdos avatar ssorc3 avatar titoeb avatar tushushu avatar volodymyrorlov avatar z1queue avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

smartcore's Issues

Implement OneHotEncoder

Implement OneHotEncoder, make logic similar to Scikit Learn's

The new encoder should belong to new module preprocessing and produce results on par with Scikit Learn

Implement hierarchical clustering

Motivation: why do we need hierarchical when we have already kmeans?

Vocabulary:

  • divisive clustering: ...
  • agglomerative clustering: average, weighted, median, centroid, Ward

Sub-tasks:

  • pick one or a minimal set of metrics-distances
  • pick one or a minimal set of linkage strategies
  • pick one or more algorithms (SLINK for single-linkage and CLINK for complete-linkage clustering)

Visualisations: (?)

Other implementations:

Add OOB predictions to random forests

I am using in-sample out-of-bag (OOB) predictions to estimate the KL-divergence between samples. In general, OOB predictions are an efficient alternative to CV to estimate out of sample prediction performance and can be used for tuning.

Getting OOB predictions requires storing the samples used to build each tree (i.e. samples here). This could be made optional. We can then add up predictions for samples only that were OOB for a particular tree, keeping track of the number of trees for which a particular sample was OOB.

I could work on a PR, but might need some help with details and guidance on what you think the API should be.

question in smartcore/src/svm/svr.rs?

in smartcore/src/svm/svr.rs
line 311: gmin: T::max_value(),
line 312: gmax: T::min_value(),

it looks like the gmin and gmax values are reversed.

it should be like this:
line 311: gmin: T::min_value(),
line 312: gmax: T::max_value(),?

Create changelog

With a changelog tools like dependabot can report API and dependency changes. Also we could create an UPDATE.md were we can provide guidance on how to use new features and migrate away from old ones.

We can use this for the previous releases and add an Unreleased section with the changes that are on top of the latest release.

I propose to use the keepachangelog format.

Make SerDe optional

SmartCore depends on serde and serde_derive. Let's put these libraries behind feature flag serde (name is not important)

Implement a new method that predicts probabilities, where it makes sense

Classification algorithms usually offer a way to quantify certainty of a prediction. In Scikit learn a method that returns probability estimates for all classes is called predict_proba.

We need a similar method in SmartCore. One way to do it is to define a new trait Classifier, that will have a function predict_proba, and implement this trait for every algorithm where predicts probabilities makes sense.

Feature request: time series functionality

Hello, I hope this finds you well.

Please could you implement machine learning time series functionality within SmartCore? I would like to work with machine learning-based time series forecasting, classification, regression, etc in Rust. I just thought it would be useful and interesting to me for a Rust machine learning library to implement the type of time series functionality in existing Python libraries like sktime, Prophet, and Tensorflow, and that it may be the case for others as well.

There are obviously additional considerations that would take time to address and implement like making sure data are stationary and deseasonised, windowing and framing, etc. It may be useful or necessary to implement models that perform well with time series data, such as LSTM neural networks, classical models, like ARIMA, to use as a baseline against which to compare machine learning models, additional relevant evaluation metrics, etc.

Thank you for your time and consideration.

Test silently raise error

i don't know if this is expected but when running RUST_BACKTRACE=1 cargo test -- --nocapture, one of the test output this error even if the suite is successful:

test svm::tests::linear_kernel ... ok
test model_selection::tests::run_kfold_return_test_mask_simple ... ok
test svm::tests::rbf_kernel ... ok
test svm::svc::tests::svc_fit_predict_rbf ... ok
test optimization::first_order::gradient_descent::tests::gradient_descent ... ok
   0: backtrace::backtrace::libunwind::trace
             at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.44/src/backtrace/libunwind.rs:86
   1: backtrace::backtrace::trace_unsynchronized
             at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.44/src/backtrace/mod.rs:66
   2: std::sys_common::backtrace::_print_fmt
             at src/libstd/sys_common/backtrace.rs:78
   3: <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt
             at src/libstd/sys_common/backtrace.rs:59
   4: core::fmt::write
             at src/libcore/fmt/mod.rs:1069
   5: std::io::Write::write_fmt
             at src/libstd/io/mod.rs:1427
   6: std::sys_common::backtrace::_print
             at src/libstd/sys_common/backtrace.rs:62
   7: std::sys_common::backtrace::print
             at src/libstd/sys_common/backtrace.rs:49
   8: std::panicking::default_hook::{{closure}}
             at src/libstd/panicking.rs:198
   9: std::panicking::default_hook
             at src/libstd/panicking.rs:218
  10: std::panicking::rust_panic_with_hook
             at src/libstd/panicking.rs:511
  11: std::panicking::begin_panic
             at /rustc/f509b26a7730d721ef87423a72b3fdf8724b4afa/src/libstd/panicking.rs:438
  12: <smartcore::math::distance::minkowski::Minkowski as smartcore::math::distance::Distance<alloc::vec::Vec<T>,T>>::distance
             at src/math/distance/minkowski.rs:43
  13: smartcore::math::distance::minkowski::tests::minkowski_distance_negative_p
             at src/math/distance/minkowski.rs:82
  14: smartcore::math::distance::minkowski::tests::minkowski_distance_negative_p::{{closure}}
             at src/math/distance/minkowski.rs:76
  15: core::ops::function::FnOnce::call_once
             at /rustc/f509b26a7730d721ef87423a72b3fdf8724b4afa/src/libcore/ops/function.rs:232
  16: <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once
             at /rustc/f509b26a7730d721ef87423a72b3fdf8724b4afa/src/liballoc/boxed.rs:1017
  17: <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at /rustc/f509b26a7730d721ef87423a72b3fdf8724b4afa/src/libstd/panic.rs:318
  18: std::panicking::try::do_call
             at /rustc/f509b26a7730d721ef87423a72b3fdf8724b4afa/src/libstd/panicking.rs:331
  19: std::panicking::try
             at /rustc/f509b26a7730d721ef87423a72b3fdf8724b4afa/src/libstd/panicking.rs:274
  20: std::panic::catch_unwind
             at /rustc/f509b26a7730d721ef87423a72b3fdf8724b4afa/src/libstd/panic.rs:394
  21: test::run_test_in_process
             at src/libtest/lib.rs:542
  22: test::run_test::run_test_inner::{{closure}}
             at src/libtest/lib.rs:451

No implementation of Display for Dataset

Hi!

First of all: Awesome project!

I found myself wanting to look at a dataset, and implemented this:

fn display_dataset<X: Copy + std::fmt::Debug, Y: Copy + std::fmt::Debug>(dataset: &Dataset<X, Y>) {
    struct Target<Y> {
        name: String,
        value: Y
    }
    struct Feature<X> {
        name: String,
        value: X
    }
    struct DataPoint<X, Y> {
        labels: Vec<Target<Y>>,
        features: Vec<Feature<X>>
    }
    impl <X: Copy + std::fmt::Debug, Y: Copy + std::fmt::Debug>std::fmt::Display for DataPoint<X, Y> {
        fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result {
            // Write strictly the first element into the supplied output
            // stream: `f`. Returns `fmt::Result` which indicates whether the
            // operation succeeded or failed. Note that `write!` uses syntax which
            // is very similar to `println!`.
            write!(
                f, "{} : {}",
                self.labels.iter().map(|target| format!("{}:{:?}", target.name, target.value)).collect::<String>(),
                self.features.iter().map(|feature| format!("{}:{:?}", feature.name, feature.value)).collect::<String>()
            )
        }
    }
    println!("{}", dataset.description);
    let mut datapoints = Vec::new();
    for sample_index in 0..dataset.num_samples {
        let mut features = Vec::new();
        for feature_index in 0..dataset.feature_names.len() {
            features.push(Feature{
                name: dataset.feature_names[feature_index].to_owned(),
                value: dataset.data[sample_index*dataset.num_features+feature_index]
            });
        }
        let mut targets = Vec::new();
        for target_index in 0..dataset.target_names.len() {
            targets.push(Target{
                name: dataset.target_names[target_index].to_owned(),
                value: dataset.target[sample_index*dataset.target_names.len()+target_index]
            });
        }
        datapoints.push(DataPoint {
            labels: targets,
            features
        })
    }
    for point in datapoints {
        println!("{}", point);
    }
}

Any appetite for a souped-up version of this in a PR?

Naive Bayes (NB) Classifier

Implement Base NB classifier that doesn't make any assumptions about the underlying distribution of x.

https://scikit-learn.org/stable/modules/naive_bayes.html

We need something like this (pseudocode):

trait NBDistribution:
    
    // Fit distribution to some continuous or discrete data
    def fit(x: Matrix<T>) -> NBDistribution
    
    // prior of class k 
    def prior(k) -> T

    // conditional probability of feature j give class k
    def conditional_probability(k, j)-> T

class BaseNaiveBayes:
    
    // "Fits" NB. This method validates and remembers parameters
    def fit(distribution: NBDistribution)
    
    // Calculates likelihood of labels using stored probabilities and X. Returns vector with estimated labels
    def predict(x: Matrix<T>) -> Vector<T>

Once we have BaseNaiveBayes we can implement Gaussian Naive Bayes, Multinomial Naive Bayes and Bernoulli Naive Bayes as concrete implementations of trait NBDistribution

SVM

Implement Support vector machine (SVM) classifier and regressor.

The theory behind SVM is described in the book "An Introduction to Statistical Learning" by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani.

This paper (+ references ) describes one approach to implementing Sequential minimal optimization (SMO) algorithm that can be used to speedup SVR training, while library (and paper) LaSVM describes fast implementation of another algorithm for SVC.

SmartCore should support at least these 3 kernels:

GPU backend with Emu

Hi,

This looks neat! I was curious about how hard it would be to implement a GPU-accelerated backend with Emu. Would it amount to implementing the API in linalg?

Add cargo clippy checks

Fix cargo clippy warnings and add to the CI something that checks for that.

There is a cargo clippy --fix command that can solve some of the warnings, that could be used at the beginning.

The instructions to run cargo clippy are:

rustup component add clippy # to install it
cargo clippy

Implement OrdinalEncoder

Implement OrdinalEncoder, make logic similar to Scikit Learn's

The new encoder should belong to new module preprocessing and produce results on par with Scikit Learn

Support for rust_decimal

I want to deploy smartcore without using floating-point numbers. I'll try to change the crate math::num::RealNumber to support rust_decimal.

Do you think it's possible to have something like impl RealNumber for Decimal in the trait in RealNumber or would a completely new trait be necessary?

[Ethics] interpretability, accessibility, integration

I write here some general concepts as a references or notes for bits that could be helpful to the community.
I suggest reading this paper for an overview of current and possible scenarios for ML applications.

Are interpretability principles something that should be embedded in software libraries or workflows? If yes, how can an API avoid software developers to take shortcuts to black-boxes and nudge to tackle interpretability-by-design when writing code? Is this desirable?

A missing point in the paper imo is the lack of addressing data quality and dataset growth. There should be an "evolutionary" perspective on the model performance as the dataset grows in time: i.e. which characteristics newly added data should have to improve global performance and how to monitor (data-centric MLOps approach)

empty initialized model

I want to have a static variable which holds a model, but I want it to be lazy-loaded. This way, I only have to initialize it once and reuse it during runtime. The final model has been serialized in a file from which it can be loaded.

I'm thinking of something like this:

static mut model: LinearRegression<f64, DenseMatrix<f64>> = None;

fn init(model_path: str) {
    unsafe {
        model = {
            let mut buf: Vec<u8> = Vec::new();
            File::open(&model_path)
                .and_then(|mut f| f.read_to_end(&mut buf))
                .expect("Can not load model");
            bincode::deserialize(&buf).expect("Can not deserialize the model")
        }
    }
}

But it is not possible to initialize a LinearRegression<f64, DenseMatrix<f64>> with None. Is there another easy way to initialize a "default" or "empty" model? I thought of a constructor-like API:

LinearRegression::new()

which constructs an empty model without any parameters.

Additional info

I think in python's scikit-learn this can also be achieved with sklearn.linear_model.LinearRegression().

I tried an alternative by wrapping the model type with an Option<> like this:

static mut model: Option<LinearRegression<f64, DenseMatrix<f64>>> = None;

but this brings in other challenges and feels a little hacky.

Thanks in Advance!

Implement SGDClassifier

SGDClassifier is one of the few algorithms that can be used for incremental learning and it would be a great addition to SmartCore.

This is an open-ended problem. I do not have many requirements to specific optimization method and API other than new algorithm should implement SupervisedEstimator and Predictor interfaces

Rust machine learning group

I can't find your email address, so I'm opening an issue here. There is a (not yet official) machine learning group for Rust. At the moment we are trying to implement the most popular algorithms, and find a common interface for the learning process. Would be awesome if you say hello and share your project here https://rust-ml.zulipchat.com/

What is planned for v0.2.0?

This is a list of algorithms that are planned for v0.2.0. Feel free to let me know if If there is any particular algorithm that you would like to work on/include in the upcoming release.

On top of new algorithms we plan these improvements:

Optional Features (If we have time and spare hands)

Next release is planned for the end of this year. Help needed :)

Refactor linalg module

I want to use this issue to share a heads-up on a big refactoring that I plan for the linalg module.

During last couple of month I've seen on multiple occasions limitations and shortcomings imposed by the current design of the BaseVector and BaseMatrix. To mention a couple here:

  • It is not possible to define an instance of BaseMatrix that holds string, integer type values.
  • BaseMatrix is not designed to hold values that belong to multiple types
  • Some algorithms, e.g. RandomForest, does not use most methods defined in the BaseMatrix and BaseVector. Some preprocessing methods that we plan for future, like LabelEncoder will not need linear algebra routines defined for both classes.
  • Some basic operations, like get row or get column, perform unnecessary copy. This problem stems from the fact that both structs do not provide views or iterators that lets developer access an internal structure of the data.
  • All operations are defined as functions. While this is not a big deal it leads to a clumsy looking code. Instead it would be nice to use more traits defined in std::ops

As a result, I'd like to see how can we use Rust's type system to design a better container for data that solves all these shortcomings.

I am open to any suggestions you have. Feel free to post your ideas here.

incremental learning

Hey, thanks for sharing this project!
Is incremental learning supported by smartcore; something like partial_fit from sk-learn?

datasets - deserialize_data mismatched types error

Hi, first of all thanks for your amazing works on bringing ML to Rust!

While compiling a very simple function to train and predict a linear regression model, I encountered an error from the datasets module:

#[wasm_bindgen]
pub fn basic_prediction() -> f64 {
    let x = DenseMatrix::from_2d_array(&[
        &[234.289, 235.6, 159.0, 107.608, 1947., 60.323],
        &[259.426, 232.5, 145.6, 108.632, 1948., 61.122],
        &[258.054, 368.2, 161.6, 109.773, 1949., 60.171],
        &[284.599, 335.1, 165.0, 110.929, 1950., 61.187],
        &[328.975, 209.9, 309.9, 112.075, 1951., 63.221],
        &[346.999, 193.2, 359.4, 113.270, 1952., 63.639],
        &[365.385, 187.0, 354.7, 115.094, 1953., 64.989],
        &[363.112, 357.8, 335.0, 116.219, 1954., 63.761],
        &[397.469, 290.4, 304.8, 117.388, 1955., 66.019],
        &[419.180, 282.2, 285.7, 118.734, 1956., 67.857],
        &[442.769, 293.6, 279.8, 120.445, 1957., 68.169],
        &[444.546, 468.1, 263.7, 121.950, 1958., 66.513],
        &[482.704, 381.3, 255.2, 123.366, 1959., 68.655],
        &[502.601, 393.1, 251.4, 125.368, 1960., 69.564],
        &[518.173, 480.6, 257.2, 127.852, 1961., 69.331],
        &[554.894, 400.7, 282.7, 130.081, 1962., 70.551],
    ]);

    let y: Vec<f64> = vec![
        83.0, 88.5, 88.2, 89.5, 96.2, 98.1, 99.0, 100.0, 101.2, 104.6, 108.4, 110.8, 112.6, 114.2,
        115.7, 116.9,
    ];
    let (x_train, x_test, y_train, y_test) = train_test_split(&x, &y, 0.2, true);
    let y_hat_lr = LinearRegression::fit(&x_train, &y_train, Default::default())
    .and_then(|lr| lr.predict(&x_test)).unwrap();
    let mse = mean_squared_error(&y_test, &y_hat_lr);

    return mse;
}
direnc@direnc-VirtualBox:~/workspace/nodejs-rust$ wasm-pack build --target nodejs
[INFO]: Checking for the Wasm target...
[INFO]: Compiling to Wasm...
   Compiling smartcore v0.2.0
error[E0308]: mismatched types
  --> /home/direnc/.cargo/registry/src/github.com-1ecc6299db9ec823/smartcore-0.2.0/src/dataset/mod.rs:88:49
   |
88 |         let num_features = usize::from_le_bytes(buffer);
   |                                                 ^^^^^^ expected an array with a fixed size of 4 elements, found one with 8 elements

error[E0308]: mismatched types
  --> /home/direnc/.cargo/registry/src/github.com-1ecc6299db9ec823/smartcore-0.2.0/src/dataset/mod.rs:90:48
   |
90 |         let num_samples = usize::from_le_bytes(buffer);
   |                                                ^^^^^^ expected an array with a fixed size of 4 elements, found one with 8 elements

error: aborting due to 2 previous errors

For more information about this error, try `rustc --explain E0308`.
error: could not compile `smartcore`

To learn more, run the command again with --verbose.
Error: Compiling your crate to WebAssembly failed
Caused by: failed to execute `cargo build`: exited with exit code: 101
  full command: "cargo" "build" "--lib" "--release" "--target" "wasm32-unknown-unknown"

The error occurs in mod.rs inside the deserialize_data function. When changing the buffers from 8 to 4 as shown below, the code builds, but throws a RuntimeError somewhere.

...
 let mut buffer = [0u8; 4];
 buffer.copy_from_slice(&bytes[0..4]);
...
const nodejsrust = require('nodejs-rust')
console.log(nodejsrust.basic_prediction())
wasm://wasm/0003e286:1
RuntimeError: unreachable
    at <anonymous>:wasm-function[56]:0xb419
    at <anonymous>:wasm-function[73]:0xbf1e
    at <anonymous>:wasm-function[128]:0xd073
    at <anonymous>:wasm-function[117]:0xce92
    at <anonymous>:wasm-function[124]:0xcfd6
    at <anonymous>:wasm-function[101]:0xca19
    at <anonymous>:wasm-function[20]:0x89ea
    at <anonymous>:wasm-function[9]:0x68ac
    at <anonymous>:wasm-function[15]:0x7cad
    at basic_prediction (<anonymous>:wasm-function[189]:0xd537)

EDIT1: It was supposed to be fixed with #88 , but when applying the changes locally, the RuntimeError still occurs.
EDIT2:
It seems to be an issue with the wasm-bindings... I tried disabling the dataset feature by specifying smartcore = {version = "0.2.0", default-features = false} and the function runs via main, but still panics the RuntimeError..

Additional Info

I'm compiling to WASM with wasm-pack for nodejs.

[dependencies]
wasm-bindgen = "0.2.63"
smartcore = "0.2.0"

Ubuntu 20.04.2 LTS
cargo 1.51.0 (43b129a20 2021-03-16)
wasm-pack 0.9.1

mod.rs deserialize_data uses architecture specific usize which errors on 32bit linux/windows

pub(crate) fn deserialize_data(
    bytes: &[u8],
) -> Result<(Vec<f32>, Vec<f32>, usize, usize), io::Error> {
    // read the same file back into a Vec of bytes
    let (num_samples, num_features) = {
        let mut buffer = [0u8; 8];
        buffer.copy_from_slice(&bytes[0..8]);
        let num_features = usize::from_le_bytes(buffer); // This line does not compile on 32bit systems. Change to u64
        buffer.copy_from_slice(&bytes[8..16]);
        let num_samples = usize::from_le_bytes(buffer);
        (num_samples, num_features)
    };
  ...
}

Implement StandardScaler

StandardScaler standardizes features by removing the mean and scaling to unit variance. Implementation details and parameters can be found in Scikit Learn.

The algorithm should be implemented as a struct that extends Transformer. The struct should belong to a new module preprocessing

missing implementations to serialize models

I'm trying to serialize a simple LinearRegression<f64, DenseMatrix<f64>> model, which fails in current development state,

the trait `serde::ser::Serialize` is not implemented for `smartcore::linear::linear_regression::LinearRegression<f64, smartcore::linalg::naive::dense_matrix::DenseMatrix<f64>>

This also applies to f32 and occurs for serializing and deserializing. It works on public version 0.2.0.

let model_binary = bincode::serialize(&model).expect("Can not serialize the model");
...
...
bincode::deserialize(&buf).expect("Can not deserialize the model");
...

info

[dependencies]
wasm-bindgen = "=0.2.63"
smartcore = {git = "https://github.com/smartcorelib/smartcore", branch="development"}
serde = "1.0.125"
bincode = "1.3.3"
ssvm-wasi-helper = "=0.1.0"

Implement a generic read_csv method

In many cases data analysis starts from loading dataset into memory. Some datasets comes as a CSV file. We need a new default function read_csv that is defined on the BaseMatrix trait.

This story is not fully defined and a lot of details should be discussed prior to working on implementation. For example, I am not sure what parameters (if any) his function should take. Some ideas can be borrowed from the similar function in Pandas

Thoughts on moving the linalg and math abstractions into a standalone crate?

Hello Smartcore team,

I'm pretty new to the rust language, and have been working on a personal machine learning project to try to learn the language a bit better. In the process of looking for good code examples, I came upon this library, and am impressed with the organization and level of abstraction present in some modules. Particularly, I keep thinking about how useful it would be to develop my personal project against the interface provided by the n-dimensional array/vector/real number abstractions present in the linalg and math modules of this crate.

What are your thoughts on making those modules, linalg and math, part of a standalone crate that Smartcore then depends on? I feel like as those abstractions continue to be refined (e.g. #108), they could become an invaluable part of the ML/AI ecosystem in rust.

If this isn't something you're all interested in, no worries! Just a thought and figured I'd put it out there.

Cheers, from an aspiring rust developer!
-Sean

LASSO regression

Implement LASSO Regression that is similar in functionality to Scikit's Lasso.

We are looking to support following parameters:

  • alpha
  • normalize
  • tol
  • maxIter

This paper describes an optimization method that is comparable to coordinate descent in solving large problems with modest accuracy, but is able to solve them with high accuracy with relatively small additional computational cost.

This method comes with a code that can be found here

Release schedule

Is there any release planned for smartcore? I have a crate changeforest that depends on the latest commits implementing seeded & oob random forests and would like to publish this to crates.io. A release of smartcore=0.3.0 would help a lot.

Allow setting seed for `RandomForestClassifier` and `Regressor`

To make them reproducible. This would include passing a RNG to RandomForestClassifier::<T>::sample_with_replacement:

for _ in 0..parameters.n_trees {
let samples = RandomForestClassifier::<T>::sample_with_replacement(&yi, k);

to be used instead of rng::thread_rng():
fn sample_with_replacement(y: &[usize], num_classes: usize) -> Vec<usize> {
let mut rng = rand::thread_rng();

to subsample rows for each tree.

The same RNG would also need to be passed to DecisionTreeClassifier::fit_weak_learner:

let tree = DecisionTreeClassifier::fit_weak_learner(x, y, samples, mtry, params)?;

and then to
if tree.find_best_cutoff(&mut visitor, mtry) {

to be used in this shuffle:
if mtry < n_attr {
variables.shuffle(&mut rand::thread_rng());
}

Same for the RandomForestRegressor.

I can set something up and open a PR.

Simple k-fold cross validation

K-fold cross validation (CV) is a preferred way to evaluate performance of a statistical model. CV is better than just splitting dataset into training/test sets because we use as many data samples for validation as we can get from a single dataset, thus improving estimate of out-of-the-box error.

SmartCore does not has a method for CV and this is a shame, because any good ML framework must have it.

I think we could start from a simple replica of the Scikit's sklearn.model_selection.KFold. Later on we can add replica of StratifiedKFold.

If you are not familiar with CV I would start from reading about it here and here. Next I would look at Scikit's implementation and design a function or a class that does the same for SmartCore.

We do not have to reproduce class KFold exactly, one way to do it is to write an iterator that spits out K pairs of (train, test) sets. Also, it might be helpful to see how train/test split is implemented in SmartCore

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.