yoshoku / rumale Goto Github PK
View Code? Open in Web Editor NEWRumale is a machine learning library in Ruby
Home Page: https://rubygems.org/gems/rumale
License: BSD 3-Clause "New" or "Revised" License
Rumale is a machine learning library in Ruby
Home Page: https://rubygems.org/gems/rumale
License: BSD 3-Clause "New" or "Revised" License
I am using Rumale::Ensemble::GradientBoostingRegressor.new with a 250MB csv file as training data, and I am getting
rumale/tree/gradient_tree_regressor.rb:117:in `apply': stack level too deep (SystemStackError)
during the fit
Bundler 2.2.11
Platforms ruby, x86_64-linux
I tried to ulimit -s unlimited
,
> ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 127649
max locked memory (kbytes, -l) 65536
max memory size (kbytes, -m) unlimited
open files (-n) 65535
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 65535
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
but still no luck.
Thanks.
Samples and Labels are not Ruby Array but NArray.
You can create NArray from Ruby Array.
samples = [[5.1, 3.5, 1.4, 0.2], [4.9, 3.0, 1.4, 0.2], [4.7, 3.2, 1.3, 0.2]]
samples = Numo::DFloat.cast(samples)
# samples = Numo::DFloat[*samples]
labels = [0,0,1]
labels = Numo::Int32.cast(labels)
# labels = Numo::Int32.cast(labels)
I've noticed that people complain about scikit-learn lacking the option to generate summary. Is this feature already present in rumale? If not, can we look into that? I would love to contribute, but would require some starters.
Is it worth adding HDBSCAN?
Thank you for your awesome project.
In scikit-learn, `classification_report' function shows the classification metrics per-class basis.
I cannot find alternative method in Rumale. Is there a simple way to obtain similar report?
I was testing rumale examples from https://yoshoku.github.io/rumale/doc/
The first example failed with this error - undefined local variable or method `x' for main:Object (NameError)
I replaced x with features and y with labels and it worked.
first, thank you so much for providing this gem. It is awesome to have such an easy API to work with in Ruby.
I noticed that there are no incremental learning/online versions of the algorithms. For example, the SGDClassifier
in sklearn, does support partial_fit. Are you planning on implementing something like this?
I'd like to add a feature request for SNN Clustering (Shared Nearest Neighbors) and possibly also HDBSCAN.
The advantage of SNN over DBSCAN is to be able to identify clusters with different densities, and also it does not need to be provided a fixed number of clusters as k-means does. Also its way of identifying similarity between items has advantages over euclidian distance when working with a higher number of dimensions.
HDBSCAN is an interesting improvement over DBSCAN, as it only requires one hyperparameter, and reports a probability of assignment to a cluster (which can be used to optimize the minPts hyperparameter). Towards Data Science: How to cluster in High Dimensions has an interesting overview including possible advantages of HDBSCAN and SNN (in the variant SNN-cliq).
Both algorithms have no Ruby implementation yet, as far to my knowledge.
PS: thanks for all your work on Rumale so far, it's greatly appreciated.
以下のようなコードを実行したところ、predict
の出力が直感と合わない結果になりました。
require 'rumale'
require 'numo/narray'
require 'numo/gnuplot'
require 'numo/linalg/autoloader'
include Numo
x = DFloat[[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]].transpose
y = DFloat[[10, 9, 8, 7, 6, 5, 4, 3, 2, 1]].transpose
e = Rumale::LinearModel::LinearRegression.new(fit_bias: true)
e.fit(x, y)
y_ = e.predict(x)
Numo.gnuplot do
set term: :svg
set out: 'out.svg'
set key: :bottom
plot [x, y, pt: 6, t: 'Data'],
[x, y_, w: :lines, lw: 3, t: 'Regression']
end
なお、solverをsvdにすると直感通りの結果となりました。
e = Rumale::LinearModel::LinearRegression.new(fit_bias: true, solver: 'svd')
デフォルトのsolverを利用する時は何らかの追加の設定などが必要なのでしょうか。
Hey!
I'm following your work on Rumale and would suggest, you'd better add the Logo to the social media preview which is newly available for project repositories. This ensures that links to Rumale's repository show the logo and not your avatar on Github.
To do so please go to "Settings->Social media preview" in this repository.
Thank you again for your work on this topic, it's very useful!
At present, cross validation allows for use of only the best MSE to evaluate a model. It may be good to:
output an array with MSEs for all parameter combinations which users can examine to choose what they consider to be the best parameters, see for example https://github.com/escape2020/school2021/blob/main/machine-learning-2/Validation%20and%20Optimization.ipynb
Offer the option to choose a model that balances best MSE with the least number of model parameters, perhaps using a metric that also values sparsity such as Tikhonov regularization may be a good thing to add.
These could also be presented as examples for people to adapt, rather than as additions to the core library since they can then be adapted to specific situations.
https://teratail.com/questions/359674
Windowsだと確かに上記のエラーが出るようです。
Here's a quick demonstration, adapted from the spec:
it 'does not murder us' do
x = Numo::Int32[[0, 0, 5999999], [1, 1, 0], [0, 2, 1], [1, 0, 2]]
y = Numo::Int32[[0, 1, 1]]
expect(encoder.fit(x).transform(y)).to eq(Numo::Int32[[1, 0, 0, 1, 0, 0, 1, 0, 0]])
end
This return a vector of length 60_000_000 or so. My system actually shows 1.5 TB(!) of RAM being consumed, although that is luckily almost completely virtual. It is, however, impossible to work with such values.
I was expecting the transformation to interpret the given values as catgorical, as it is mentioned in the docs and is the usual practice as far as I can tell.
I tried installing with Ruby 2.3, but I ran into this error while building rumale-tree-0.27:
extconf.rb:26:in <main>': undefined method
match?' for "aarch64-linux":String (NoMethodError)
That suggests Ruby 2.4 is the minimum version.
https://rubygems.org/gems/rumale doesn't provide much information as well. Can you help clarify which version of Ruby you are targeting?
Thank you!
Hi,
In libsvm MNIST file there are skipped zero values, and it makes size of one line not correct.
My suggestion here is to add option to determine size of libsvm line.
2.7.1 :004 > x, y = Rumale::Dataset.load_libsvm_file("mnist", zero_based: true)
2.7.1 :005 >
2.7.1 :006 > y
=>
Numo::Int32#shape=[60000]
[5, 0, 4, 1, 9, 2, 1, 3, 1, 4, 3, 5, 3, 6, 1, 7, 2, 8, 6, 9, 4, 0, 9, 1, 1, ...]
2.7.1 :007 > x
=>
Numo::DFloat#shape=[60000,781]
My desired size of x is [60000,784]
.
If any knows how can I do it, I'd be happy to hear that.
Hey @yoshoku, this library is great! Ruby could really use a comprehensive ML library.
It looks like the source code is well-commented, but I can't find online documentation for it. I think it would really help users if there was documentation similar to Scikit-learn (linear regression example). I had no idea how much you could with it until diving into the source code.
Can I try add the algorithm of the association task, FP-Growth, in this repo?
@yoshoku - Thank you for this awesome gem.
I tried to convert an example simple linear regression example from Python to Ruby. Here is my Python example and Ruby example.
In sklearn, train_test_split seems like an easy function to convert dataset into training and test sets.
# Import Dataset
dataset = pd.read_csv('salary_data.csv')
X = dataset.iloc[:, :-1].values # Take all rows and columns except last one
y = dataset.iloc[:, 1].values # Take all rows of column with index 1
# Split dataset into Training and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
However, I couldn't find something similar in rumale. So I implemented like this:
# Import Dataset
df = Daru::DataFrame.from_csv('salary_data.csv')
# Convert dataset to Numo::NArray dataset
data = Numo::DFloat.cast df['YearsExperience', 'Salary'].to_a[0].map { |data| data.values }
# Split dataset into Training and Test set
x_train, y_train, x_test, y_test = nil
Rumale::ModelSelection::ShuffleSplit.new(test_size: 0.3, n_splits: 1, random_seed: 1).split(data).each do |train, test|
x_train, y_train = data[train, true][true, 0..-2], data[train, true][true, 1..-1]
x_test, y_test = data[test, true][true, 0..-2], data[test, true][true, 1..-1]
end
If I am not missing an easy way provided within rumale to split dataset, I think it would be great to have a method similar to train_test_split in rumale.
New algorithm request
I want t-sne rather than neural networks.
gem install rumale got error
Building native extensions. This could take a while...
ERROR: Error installing rumale:
ERROR: Failed to build gem native extension.
current directory: /Users/kietdv/.rvm/gems/jruby-9.2.0.0/gems/numo-narray-0.9.1.6/ext/numo/narray
/Users/kietdv/.rvm/rubies/jruby-9.2.0.0/bin/jruby -r ./siteconf20200222-38612-w78zw1.rb extconf.rb
checking for stdbool.h... RuntimeError: The compiler failed to generate an executable file.
You have to install development tools first.
try_do at /Users/kietdv/.rvm/rubies/jruby-9.2.0.0/lib/ruby/stdlib/mkmf.rb:456
try_cpp at /Users/kietdv/.rvm/rubies/jruby-9.2.0.0/lib/ruby/stdlib/mkmf.rb:587
block in have_header at /Users/kietdv/.rvm/rubies/jruby-9.2.0.0/lib/ruby/stdlib/mkmf.rb:1091
block in checking_for at /Users/kietdv/.rvm/rubies/jruby-9.2.0.0/lib/ruby/stdlib/mkmf.rb:942
block in postpone at /Users/kietdv/.rvm/rubies/jruby-9.2.0.0/lib/ruby/stdlib/mkmf.rb:350
open at /Users/kietdv/.rvm/rubies/jruby-9.2.0.0/lib/ruby/stdlib/mkmf.rb:320
block in postpone at /Users/kietdv/.rvm/rubies/jruby-9.2.0.0/lib/ruby/stdlib/mkmf.rb:350
open at /Users/kietdv/.rvm/rubies/jruby-9.2.0.0/lib/ruby/stdlib/mkmf.rb:320
postpone at /Users/kietdv/.rvm/rubies/jruby-9.2.0.0/lib/ruby/stdlib/mkmf.rb:346
checking_for at /Users/kietdv/.rvm/rubies/jruby-9.2.0.0/lib/ruby/stdlib/mkmf.rb:941
have_header at /Users/kietdv/.rvm/rubies/jruby-9.2.0.0/lib/ruby/stdlib/mkmf.rb:1090
<main> at extconf.rb:60
*** extconf.rb failed ***
Could not create Makefile due to some reason, probably lack of necessary
libraries and/or headers. Check the mkmf.log file for more details. You may
need configuration options.
Provided configuration options:
--with-opt-dir
--without-opt-dir
--with-opt-include
--without-opt-include=${opt-dir}/include
--with-opt-lib
--without-opt-lib=${opt-dir}/lib
--with-make-prog
--without-make-prog
--srcdir=.
--curdir
--ruby=/Users/kietdv/.rvm/rubies/jruby-9.2.0.0/bin/jruby
To see why this extension failed to compile, please check the mkmf.log which can be found here:
/Users/kietdv/.rvm/gems/jruby-9.2.0.0/extensions/universal-java-1.8/2.5.0/numo-narray-0.9.1.6/mkmf.log
extconf failed, exit code 1
Gem files will remain installed in /Users/kietdv/.rvm/gems/jruby-9.2.0.0/gems/numo-narray-0.9.1.6 for inspection.
Results logged to /Users/kietdv/.rvm/gems/jruby-9.2.0.0/extensions/universal-java-1.8/2.5.0/numo-narray-0.9.1.6/gem_make.out
Hello.
I have a question and a suggestion.
Do you want to change the default value of fit_bias
to true
?
In the Rumale::LinearModel::LinearRegression
class, the default value of fit_bias
is false
.
In my understanding, this means that the regression line will always go through (0, 0) unless you set fit_bias: true
.
This is difficult for starters, I think.
If you're new to Rumale, you don't understand why the regression line doesn't work. They can't tell if Rumale is a buggy or missing an option until they search the document. Some people may quit before looking for documentation.
fit_intercept=True is the default value in sklearn.
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
I think it would be nice if this project has a fashionable logo.
Hello,
first, congratz for this wonderfull project!
I'm using some of your clustering algorithms and I was wondering if you plan to add some basic metrics as the Silhouette Coefficient, the Calinski-Harabasz Index and the Davies-Bouldin Index. According to https://scikit-learn.org/stable/modules/clustering.html it will help in the chose of clusters number in a case of we have no clue on this number.
解決したいこととしましては下記の通りです。
『機械学習とラベル予測の実行速度をできる限り高速化したい』
https://github.com/rabbix/machine_learning_sample_rb
.
├── data
│ └── sample.txt (サンプルデータ)
├── model
│ └── sample_model.dat (学習後に生成される学習モデル)
├── sample_gen_model.rb (学習モデル作成用プログラム)
└── sample_predict.rb (ラベル予測用のプログラム)
サンプルデータのラベルの内容は下記の通りです。
0: スポーツ、1: 天気、2: サイエンス
現在予測用のプログラムでは、予測したいテキストは下記のように1つの文字列として定義されています。
# 予測したいテキスト
text = "Sports physical activities with competitive or recreational elements"
文章1つ(文字列)を複数の文章(文字列配列)として定義した時に、文章1つの場合と同等の実行速度でラベル予測したいです。
期待されるmain内のコードと実行結果は下記のような感じです。
想定される配列の長さは400程度です。
<期待されるコード(sample_predict.rb)>
# main
if __FILE__ == $0
# 学習モデルを読み込み
model = Marshal.load(File.binread("./model/sample_model.dat"))
text = [
"Sports physical activities with competitive or recreational elements",
"Athletic endeavors for exercise, fun, or competition",
"Sporting activities that promote physical fitness and skill",
"Physical games or contests for entertainment purposes",
"Weather atmospheric conditions at a specific location",
"Climate patterns that affect daily conditions",
"Temperature, precipitation, wind and other meteorological factors",
"Science: systematic study of the natural world",
"Observation, experimentation, and analysis of phenomena",
"Evidence-based exploration of the physical universe"
]
# 正規化(??)
normalizer = Rumale::Preprocessing::L2Normalizer.new
new_samples = normalizer.fit_transform(get_predict_featurevector(text))
# 予想
puts model.predict(new_samples).to_a
end
<期待される実行結果(sample_predict.rb)>
$ ruby sample_predict.rb
0
0
0
0
1
1
1
2
2
2
time for label prediction: 0.268448s // 文章1つの場合と同等の実行速度が欲しい
解決策があれば教えて頂きたいです。
宜しくお願いします。
Hi,
I'm seeking confirmation regarding the compatibility of the rumale 0.28.1 with the recently released Ruby 3.3.0 (December 25, 2023).
While I reviewed the gem's documentation on https://github.com/yoshoku/rumale/blob/main/.github/workflows/main.yml [ruby: [ '3.0', '3.1', '3.2', '3.3' ]]. Does this mean it is compatible with the 3.3.0 ?
Given your expertise in this area, could you offer any insights or leads on this matter? Any information you can provide would be greatly appreciated.
Thank you for your time and assistance!
Just a minor nitpick...
Rumale is a machine learninig library in Ruby
should probably be learning
instead of learninig
.
Thanks for creating and sharing this project.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.