yoshoku / rumale Goto Github PK

View Code? Open in Web Editor NEW

764.0 27.0 31.0 1.64 MB

Rumale is a machine learning library in Ruby

Home Page: https://rubygems.org/gems/rumale

License: BSD 3-Clause "New" or "Revised" License

Shell 0.29% JavaScript 0.01% Ruby 97.60% C++ 2.11%

ruby machine-learning data-science data-analysis artificial-intelligence rubyml ml

rumale's Introduction

Rumale

Rumale (Ruby machine learning) is a machine learning library in Ruby. Rumale provides machine learning algorithms with interfaces similar to Scikit-Learn in Python. Rumale supports Support Vector Machine, Logistic Regression, Ridge, Lasso, Multi-layer Perceptron, Naive Bayes, Decision Tree, Gradient Tree Boosting, Random Forest, K-Means, Gaussian Mixture Model, DBSCAN, Spectral Clustering, Mutidimensional Scaling, t-SNE, Fisher Discriminant Analysis, Neighbourhood Component Analysis, Principal Component Analysis, Non-negative Matrix Factorization, and many other algorithms.

Installation

Add this line to your application's Gemfile:

gem 'rumale'

And then execute:

$ bundle

Or install it yourself as:

$ gem install rumale

Documentation

Rumale API Documentation

Usage

Example 1. Pendigits dataset classification

Rumale provides function loading libsvm format dataset file. We start by downloading the pendigits dataset from LIBSVM Data web site.

$ wget https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/pendigits
$ wget https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/pendigits.t

Training of the classifier with Linear SVM and RBF kernel feature map is the following code.

require 'rumale'

# Load the training dataset.
samples, labels = Rumale::Dataset.load_libsvm_file('pendigits')

# Map training data to RBF kernel feature space.
transformer = Rumale::KernelApproximation::RBF.new(gamma: 0.0001, n_components: 1024, random_seed: 1)
transformed = transformer.fit_transform(samples)

# Train linear SVM classifier.
classifier = Rumale::LinearModel::SVC.new(reg_param: 0.0001)
classifier.fit(transformed, labels)

# Save the model.
File.open('transformer.dat', 'wb') { |f| f.write(Marshal.dump(transformer)) }
File.open('classifier.dat', 'wb') { |f| f.write(Marshal.dump(classifier)) }

Classifying testing data with the trained classifier is the following code.

require 'rumale'

# Load the testing dataset.
samples, labels = Rumale::Dataset.load_libsvm_file('pendigits.t')

# Load the model.
transformer = Marshal.load(File.binread('transformer.dat'))
classifier = Marshal.load(File.binread('classifier.dat'))

# Map testing data to RBF kernel feature space.
transformed = transformer.transform(samples)

# Classify the testing data and evaluate prediction results.
puts("Accuracy: %.1f%%" % (100.0 * classifier.score(transformed, labels)))

# Other evaluating approach
# results = classifier.predict(transformed)
# evaluator = Rumale::EvaluationMeasure::Accuracy.new
# puts("Accuracy: %.1f%%" % (100.0 * evaluator.score(results, labels)))

Execution of the above scripts result in the following.

$ ruby train.rb
$ ruby test.rb
Accuracy: 98.5%

Example 2. Cross-validation

require 'rumale'

# Load dataset.
samples, labels = Rumale::Dataset.load_libsvm_file('pendigits')

# Define the estimator to be evaluated.
lr = Rumale::LinearModel::LogisticRegression.new

# Define the evaluation measure, splitting strategy, and cross validation.
ev = Rumale::EvaluationMeasure::Accuracy.new
kf = Rumale::ModelSelection::StratifiedKFold.new(n_splits: 5, shuffle: true, random_seed: 1)
cv = Rumale::ModelSelection::CrossValidation.new(estimator: lr, splitter: kf, evaluator: ev)

# Perform 5-cross validation.
report = cv.perform(samples, labels)

# Output result.
mean_accuracy = report[:test_score].sum / kf.n_splits
puts "5-CV mean accuracy: %.1f%%" % (100.0 * mean_accuracy)

Execution of the above scripts result in the following.

$ ruby cross_validation.rb
5-CV mean accuracy: 95.5%

Speedup

Numo::Linalg

Rumale uses Numo::NArray for typed arrays. Loading the Numo::Linalg allows to perform matrix and vector product of Numo::NArray using BLAS libraries. Some machine learning algorithms frequently compute matrix and vector products, the execution speed of such algorithms can be expected to be accelerated.

Install Numo::Linalg gem.

$ gem install numo-linalg

In ruby script, just load Numo::Linalg along with Rumale.

require 'numo/linalg/autoloader'
require 'rumale'

Numo::Linalg allows user selection of background libraries for BLAS/LAPACK. Instead of fixing the background library, Numo::OpenBLAS and Numo::BLIS are available to simplify installation.

Numo::TinyLinalg

Numo::TinyLinalg is a subset library from Numo::Linalg consisting only of methods used in machine learning algorithms. Numo::TinyLinalg only supports OpenBLAS as a backend library for BLAS and LAPACK. If the OpenBLAS library is not found during installation, Numo::TinyLinalg downloads and builds that.

$ gem install numo-tiny_linalg

Load Numo::TinyLinalg instead of Numo::Linalg.

require 'numo/tiny_linalg'

Numo::Linalg = Numo::TinyLinalg

require 'rumale'

Parallel

Several estimators in Rumale support parallel processing. Parallel processing in Rumale is realized by Parallel gem, so install and load it.

$ gem install parallel

require 'parallel'
require 'rumale'

Estimators that support parallel processing have n_jobs parameter. When -1 is given to n_jobs parameter, all processors are used.

estimator = Rumale::Ensemble::RandomForestClassifier.new(n_jobs: -1, random_seed: 1)

Related Projects

Rumale::SVM provides support vector machine algorithms in LIBSVM and LIBLINEAR with Rumale interface.
Rumale::Torch provides the learning and inference by the neural network defined in torch.rb with Rumale interface.

License

The gem is available as open source under the terms of the BSD-3-Clause License.

rumale's People

Contributors

Stargazers

Watchers

rumale's Issues

When loading dataset from libsvm file, determine size of line

Hi,
In libsvm MNIST file there are skipped zero values, and it makes size of one line not correct.
My suggestion here is to add option to determine size of libsvm line.

2.7.1 :004 > x, y = Rumale::Dataset.load_libsvm_file("mnist", zero_based: true)

2.7.1 :005 > 
2.7.1 :006 > y
 => 
Numo::Int32#shape=[60000]
[5, 0, 4, 1, 9, 2, 1, 3, 1, 4, 3, 5, 3, 6, 1, 7, 2, 8, 6, 9, 4, 0, 9, 1, 1, ...] 
2.7.1 :007 > x
 => 
Numo::DFloat#shape=[60000,781]

My desired size of x is [60000,784].

If any knows how can I do it, I'd be happy to hear that.

FP-Gowth algorithm

Can I try add the algorithm of the association task, FP-Growth, in this repo?

Unable to install gem under rvm on jruby-9.2.0.0

gem install rumale got error

Building native extensions. This could take a while...
ERROR:  Error installing rumale:
        ERROR: Failed to build gem native extension.

    current directory: /Users/kietdv/.rvm/gems/jruby-9.2.0.0/gems/numo-narray-0.9.1.6/ext/numo/narray
/Users/kietdv/.rvm/rubies/jruby-9.2.0.0/bin/jruby -r ./siteconf20200222-38612-w78zw1.rb extconf.rb
checking for stdbool.h... RuntimeError: The compiler failed to generate an executable file.
You have to install development tools first.

                 try_do at /Users/kietdv/.rvm/rubies/jruby-9.2.0.0/lib/ruby/stdlib/mkmf.rb:456
                try_cpp at /Users/kietdv/.rvm/rubies/jruby-9.2.0.0/lib/ruby/stdlib/mkmf.rb:587
   block in have_header at /Users/kietdv/.rvm/rubies/jruby-9.2.0.0/lib/ruby/stdlib/mkmf.rb:1091
  block in checking_for at /Users/kietdv/.rvm/rubies/jruby-9.2.0.0/lib/ruby/stdlib/mkmf.rb:942
      block in postpone at /Users/kietdv/.rvm/rubies/jruby-9.2.0.0/lib/ruby/stdlib/mkmf.rb:350
                   open at /Users/kietdv/.rvm/rubies/jruby-9.2.0.0/lib/ruby/stdlib/mkmf.rb:320
      block in postpone at /Users/kietdv/.rvm/rubies/jruby-9.2.0.0/lib/ruby/stdlib/mkmf.rb:350
                   open at /Users/kietdv/.rvm/rubies/jruby-9.2.0.0/lib/ruby/stdlib/mkmf.rb:320
               postpone at /Users/kietdv/.rvm/rubies/jruby-9.2.0.0/lib/ruby/stdlib/mkmf.rb:346
           checking_for at /Users/kietdv/.rvm/rubies/jruby-9.2.0.0/lib/ruby/stdlib/mkmf.rb:941
            have_header at /Users/kietdv/.rvm/rubies/jruby-9.2.0.0/lib/ruby/stdlib/mkmf.rb:1090
                 <main> at extconf.rb:60
*** extconf.rb failed ***
Could not create Makefile due to some reason, probably lack of necessary
libraries and/or headers.  Check the mkmf.log file for more details.  You may
need configuration options.

Provided configuration options:
        --with-opt-dir
        --without-opt-dir
        --with-opt-include
        --without-opt-include=${opt-dir}/include
        --with-opt-lib
        --without-opt-lib=${opt-dir}/lib
        --with-make-prog
        --without-make-prog
        --srcdir=.
        --curdir
        --ruby=/Users/kietdv/.rvm/rubies/jruby-9.2.0.0/bin/jruby

To see why this extension failed to compile, please check the mkmf.log which can be found here:

  /Users/kietdv/.rvm/gems/jruby-9.2.0.0/extensions/universal-java-1.8/2.5.0/numo-narray-0.9.1.6/mkmf.log

extconf failed, exit code 1

Gem files will remain installed in /Users/kietdv/.rvm/gems/jruby-9.2.0.0/gems/numo-narray-0.9.1.6 for inspection.
Results logged to /Users/kietdv/.rvm/gems/jruby-9.2.0.0/extensions/universal-java-1.8/2.5.0/numo-narray-0.9.1.6/gem_make.out

feature request: clustering metrics

Hello,
first, congratz for this wonderfull project!
I'm using some of your clustering algorithms and I was wondering if you plan to add some basic metrics as the Silhouette Coefficient, the Calinski-Harabasz Index and the Davies-Bouldin Index. According to https://scikit-learn.org/stable/modules/clustering.html it will help in the chose of clusters number in a case of we have no clue on this number.

Online versions of algorithms

first, thank you so much for providing this gem. It is awesome to have such an easy API to work with in Ruby.

I noticed that there are no incremental learning/online versions of the algorithms. For example, the SGDClassifier in sklearn, does support partial_fit. Are you planning on implementing something like this?

デフォルトのsolverを使った場合のRumale::LinearModel::LinearRegressionの結果が直感と合いません

以下のようなコードを実行したところ、predictの出力が直感と合わない結果になりました。

require 'rumale'
require 'numo/narray'
require 'numo/gnuplot'
require 'numo/linalg/autoloader'

include Numo

x = DFloat[[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]].transpose
y = DFloat[[10, 9, 8, 7, 6, 5, 4, 3, 2, 1]].transpose

e = Rumale::LinearModel::LinearRegression.new(fit_bias: true)
e.fit(x, y)

y_ = e.predict(x)

Numo.gnuplot do
  set term: :svg
  set out: 'out.svg'
  set key: :bottom
  plot [x, y, pt: 6, t: 'Data'],
       [x, y_, w: :lines, lw: 3, t: 'Regression']
end

描画されたグラフ：

なお、solverをsvdにすると直感通りの結果となりました。

e = Rumale::LinearModel::LinearRegression.new(fit_bias: true, solver: 'svd')

描画されたグラフ：

デフォルトのsolverを利用する時は何らかの追加の設定などが必要なのでしょうか。

teratail: Rubyにrumaleをインストールできません

https://teratail.com/questions/359674

Windowsだと確かに上記のエラーが出るようです。

Documentation

Hey @yoshoku, this library is great! Ruby could really use a comprehensive ML library.

It looks like the source code is well-commented, but I can't find online documentation for it. I think it would really help users if there was documentation similar to Scikit-learn (linear regression example). I had no idea how much you could with it until diving into the source code.

OneHotEncoder blows up for large values

Here's a quick demonstration, adapted from the spec:

  it 'does not murder us' do
    x = Numo::Int32[[0, 0, 5999999], [1, 1, 0], [0, 2, 1], [1, 0, 2]]
    y = Numo::Int32[[0, 1, 1]]
    expect(encoder.fit(x).transform(y)).to eq(Numo::Int32[[1, 0, 0, 1, 0, 0, 1, 0, 0]])
  end

This return a vector of length 60_000_000 or so. My system actually shows 1.5 TB(!) of RAM being consumed, although that is luckily almost completely virtual. It is, however, impossible to work with such values.

I was expecting the transformation to interpret the given values as catgorical, as it is mentioned in the docs and is the usual practice as far as I can tell.

ラベル予測と機械学習の実行速度を上げたい

解決したいこととしましては下記の通りです。
『機械学習とラベル予測の実行速度をできる限り高速化したい』

ラベル予測の実行速度を上げたい。
機械学習の実行速度を上げたい。

サンプルプログラム

https://github.com/rabbix/machine_learning_sample_rb

ファイル構成

.
├── data
│   └── sample.txt (サンプルデータ)
├── model
│   └── sample_model.dat (学習後に生成される学習モデル)
├── sample_gen_model.rb (学習モデル作成用プログラム)
└── sample_predict.rb (ラベル予測用のプログラム)

サンプルデータ

サンプルデータのラベルの内容は下記の通りです。

0: スポーツ、1: 天気、2: サイエンス

現在予測用のプログラムでは、予測したいテキストは下記のように1つの文字列として定義されています。

  # 予測したいテキスト
  text = "Sports physical activities with competitive or recreational elements"

解決したい内容

文章1つ(文字列)を複数の文章(文字列配列)として定義した時に、文章1つの場合と同等の実行速度でラベル予測したいです。
期待されるmain内のコードと実行結果は下記のような感じです。
想定される配列の長さは400程度です。

期待されるコードと実行結果

<期待されるコード(sample_predict.rb)>

# main
if __FILE__ == $0

  # 学習モデルを読み込み
  model = Marshal.load(File.binread("./model/sample_model.dat"))

  text = [
    "Sports physical activities with competitive or recreational elements",
    "Athletic endeavors for exercise, fun, or competition",
    "Sporting activities that promote physical fitness and skill",
    "Physical games or contests for entertainment purposes",
    "Weather atmospheric conditions at a specific location",
    "Climate patterns that affect daily conditions",
    "Temperature, precipitation, wind and other meteorological factors",
    "Science: systematic study of the natural world",
    "Observation, experimentation, and analysis of phenomena",
    "Evidence-based exploration of the physical universe"
  ]

  # 正規化(??)
  normalizer = Rumale::Preprocessing::L2Normalizer.new
  new_samples = normalizer.fit_transform(get_predict_featurevector(text))

  # 予想
  puts model.predict(new_samples).to_a
end

<期待される実行結果(sample_predict.rb)>

$ ruby sample_predict.rb
0
0
0
0
1
1
1
2
2
2
time for label prediction: 0.268448s    // 文章1つの場合と同等の実行速度が欲しい

解決策があれば教えて頂きたいです。
宜しくお願いします。

Method similar to train_test_split of sklearn

@yoshoku - Thank you for this awesome gem.

I tried to convert an example simple linear regression example from Python to Ruby. Here is my Python example and Ruby example.

In sklearn, train_test_split seems like an easy function to convert dataset into training and test sets.

# Import Dataset
dataset = pd.read_csv('salary_data.csv')
X = dataset.iloc[:, :-1].values # Take all rows and columns except last one
y = dataset.iloc[:, 1].values # Take all rows of column with index 1

# Split dataset into Training and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

However, I couldn't find something similar in rumale. So I implemented like this:

# Import Dataset
df = Daru::DataFrame.from_csv('salary_data.csv')

# Convert dataset to Numo::NArray dataset
data = Numo::DFloat.cast df['YearsExperience', 'Salary'].to_a[0].map { |data| data.values }

# Split dataset into Training and Test set
x_train, y_train, x_test, y_test = nil
Rumale::ModelSelection::ShuffleSplit.new(test_size: 0.3, n_splits: 1, random_seed: 1).split(data).each do |train, test|
  x_train, y_train = data[train, true][true, 0..-2], data[train, true][true, 1..-1]
  x_test, y_test = data[test, true][true, 0..-2], data[test, true][true, 1..-1]
end

If I am not missing an easy way provided within rumale to split dataset, I think it would be great to have a method similar to train_test_split in rumale.

Typo in project description

Just a minor nitpick...

Rumale is a machine learninig library in Ruby should probably be learning instead of learninig.

Thanks for creating and sharing this project.

Cross validation: best score

At present, cross validation allows for use of only the best MSE to evaluate a model. It may be good to:

output an array with MSEs for all parameter combinations which users can examine to choose what they consider to be the best parameters, see for example https://github.com/escape2020/school2021/blob/main/machine-learning-2/Validation%20and%20Optimization.ipynb
Offer the option to choose a model that balances best MSE with the least number of model parameters, perhaps using a metric that also values sparsity such as Tikhonov regularization may be a good thing to add.

These could also be presented as examples for people to adapt, rather than as additions to the core library since they can then be adapted to specific situations.

fit_bias: false is the default parameter for LinearRegression. Too difficult to find out ?

Hello.

I have a question and a suggestion.
Do you want to change the default value of fit_bias to true?

In the Rumale::LinearModel::LinearRegression class, the default value of fit_bias is false.

In my understanding, this means that the regression line will always go through (0, 0) unless you set fit_bias: true.

This is difficult for starters, I think.

If you're new to Rumale, you don't understand why the regression line doesn't work. They can't tell if Rumale is a buggy or missing an option until they search the document. Some people may quit before looking for documentation.

fit_intercept=True is the default value in sklearn.
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

We need a logo for the project

I think it would be nice if this project has a fashionable logo.

Feature request : UMAP

HDBSCAN

Is it worth adding HDBSCAN?

Example 1. XOR data - fails

I was testing rumale examples from https://yoshoku.github.io/rumale/doc/

The first example failed with this error - undefined local variable or method `x' for main:Object (NameError)

I replaced x with features and y with labels and it worked.

Is there a way to show the classification report as in sklearn?

Thank you for your awesome project.

In scikit-learn, `classification_report' function shows the classification metrics per-class basis.
I cannot find alternative method in Rumale. Is there a simple way to obtain similar report?

feature request: SNN clustering

I'd like to add a feature request for SNN Clustering (Shared Nearest Neighbors) and possibly also HDBSCAN.

The advantage of SNN over DBSCAN is to be able to identify clusters with different densities, and also it does not need to be provided a fixed number of clusters as k-means does. Also its way of identifying similarity between items has advantages over euclidian distance when working with a higher number of dimensions.

HDBSCAN is an interesting improvement over DBSCAN, as it only requires one hyperparameter, and reports a probability of assignment to a cluster (which can be used to optimize the minPts hyperparameter). Towards Data Science: How to cluster in High Dimensions has an interesting overview including possible advantages of HDBSCAN and SNN (in the variant SNN-cliq).

Both algorithms have no Ruby implementation yet, as far to my knowledge.

PS: thanks for all your work on Rumale so far, it's greatly appreciated.

Required Ruby Version is unclear

I tried installing with Ruby 2.3, but I ran into this error while building rumale-tree-0.27:

extconf.rb:26:in <main>': undefined method match?' for "aarch64-linux":String (NoMethodError)

That suggests Ruby 2.4 is the minimum version.

https://rubygems.org/gems/rumale doesn't provide much information as well. Can you help clarify which version of Ruby you are targeting?

Thank you!

Compatibility Inquiry: rumale 0.28.1 with Ruby 3.3.0

Hi,

I'm seeking confirmation regarding the compatibility of the rumale 0.28.1 with the recently released Ruby 3.3.0 (December 25, 2023).

While I reviewed the gem's documentation on https://github.com/yoshoku/rumale/blob/main/.github/workflows/main.yml [ruby: [ '3.0', '3.1', '3.2', '3.3' ]]. Does this mean it is compatible with the 3.3.0 ?

Given your expertise in this area, could you offer any insights or leads on this matter? Any information you can provide would be greatly appreciated.

Thank you for your time and assistance!

How to use your own datasets (labels ＆ samples) from Ruby Array ?

Samples and Labels are not Ruby Array but NArray.

samples are Numo::DFloat object.
labels are Numo::Int32 object.

You can create NArray from Ruby Array.

samples = [[5.1, 3.5, 1.4, 0.2], [4.9, 3.0, 1.4, 0.2], [4.7, 3.2, 1.3, 0.2]]
samples = Numo::DFloat.cast(samples)
# samples = Numo::DFloat[*samples]

labels = [0,0,1]
labels = Numo::Int32.cast(labels)
# labels = Numo::Int32.cast(labels)

ruby-numo/numo-narray#132

Add option to store training summary

I've noticed that people complain about scikit-learn lacking the option to generate summary. Is this feature already present in rumale? If not, can we look into that? I would love to contribute, but would require some starters.

stack level too deep

I am using Rumale::Ensemble::GradientBoostingRegressor.new with a 250MB csv file as training data, and I am getting

rumale/tree/gradient_tree_regressor.rb:117:in `apply': stack level too deep (SystemStackError)

during the fit

Bundler 2.2.11
Platforms ruby, x86_64-linux

I tried to ulimit -s unlimited,

> ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 127649
max locked memory       (kbytes, -l) 65536
max memory size         (kbytes, -m) unlimited
open files                      (-n) 65535
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 65535
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

but still no luck.

Thanks.

Feature request : t-sne

New algorithm request
I want t-sne rather than neural networks.

I want to visualize the open-access cancer mRNA data from the TCGA (The Cancer Genome Atlas).

Add a social media preview

Hey!

I'm following your work on Rumale and would suggest, you'd better add the Logo to the social media preview which is newly available for project repositories. This ensures that links to Rumale's repository show the logo and not your avatar on Github.

To do so please go to "Settings->Social media preview" in this repository.

Thank you again for your work on this topic, it's very useful!

yoshoku / rumale Goto Github PK

rumale's Introduction

Rumale

Installation

Documentation

Usage

Example 1. Pendigits dataset classification

Example 2. Cross-validation

Speedup

Numo::Linalg

Numo::TinyLinalg

Parallel

Related Projects

License

rumale's People

Contributors

Stargazers

Watchers

Forkers

rumale's Issues

ラベル予測と機械学習の実行速度を上げたい

サンプルプログラム

ファイル構成

サンプルデータ

解決したい内容

期待されるコードと実行結果

Recommend Projects

Recommend Topics

Recommend Org