Git Product home page Git Product logo

gyrdym / ml_algo Goto Github PK

View Code? Open in Web Editor NEW
178.0 5.0 27.0 9.29 MB

Machine learning algorithms in Dart programming language

Home Page: https://gyrdym.github.io/ml_algo/

License: BSD 2-Clause "Simplified" License

Dart 99.99% Shell 0.01%
dartlang dart machine-learning linear-regression sgd stochastic-gradient-descent batch-gradient-descent mini-batch-gradient-descent machine-learning-algorithms algorithm

ml_algo's Introduction

Build Status Coverage Status pub package Gitter Chat

Machine learning algorithms for Dart developers - ml_algo library

The library is a part of the ecosystem:

Table of contents

What is ml_algo for?

The main purpose of the library is to give native Dart implementation of machine learning algorithms to those who are interested both in Dart language and data science. This library aims at Dart VM and Flutter, it's impossible to use it in web applications.

The library content

  • Model selection

    • CrossValidator. A factory that creates instances of cross validators. Cross-validation allows researchers to fit different hyperparameters of machine learning algorithms assessing prediction quality on different parts of a dataset.
  • Classification algorithms

    • LogisticRegressor. A class that performs linear binary classification of data. To use this kind of classifier your data has to be linearly separable.

      • LogisticRegressor.SGD. Implementation of the logistic regression algorithm based on stochastic gradient descent with L2 regularisation. To use this kind of classifier your data has to be linearly separable.

      • LogisticRegressor.BGD. Implementation of the logistic regression algorithm based on batch gradient descent with L2 regularisation. To use this kind of classifier your data has to be linearly separable.

      • LogisticRegressor.newton. Implementation of the logistic regression algorithm based on Newton-Raphson method with L2 regularisation. To use this kind of classifier your data has to be linearly separable.

    • SoftmaxRegressor. A class that performs linear multiclass classification of data. To use this kind of classifier your data has to be linearly separable.

    • DecisionTreeClassifier A class that performs classification using decision trees. May work with data with non-linear patterns.

    • KnnClassifier A class that performs classification using k nearest neighbours algorithm - it makes predictions based on the first k closest observations to the given one.

  • Regression algorithms

    • LinearRegressor. A general class for finding a linear pattern in training data and predicting outcomes as real numbers.

      • LinearRegressor.lasso Implementation of the linear regression algorithm based on coordinate descent with lasso regularisation

      • LinearRegressor.SGD Implementation of the linear regression algorithm based on stochastic gradient descent with L2 regularisation

      • LinearRegressor.BGD Implementation of the linear regression algorithm based on batch gradient descent with L2 regularisation

      • LinearRegressor.newton Implementation of the linear regression algorithm based on Newton-Raphson method with L2 regularisation

    • KnnRegressor A class that makes predictions for each new observation based on the first k closest observations from training data. It may catch non-linear patterns of the data.

  • Clustering and retrieval algorithms

    • KDTree An algorithm for efficient data retrieval.
    • Locality sensitive hashing. A family of algorithms that randomly partition all reference data points into different bins, which makes it possible to perform efficient K Nearest Neighbours search, since there is no need to search for the neighbours through the entire data. The family is represented by the following classes:

For more information on the library's API, please visit the API reference

Examples

Logistic regression

Let's classify records from a well-known dataset - Pima Indians Diabetes Database via Logistic regressor

Important note:

Please pay attention to problems that classifiers and regressors exposed by the library solve. For e.g., Logistic regressor solves only binary classification problems, and that means that you can't use this classifier with a dataset with more than two classes, keep that in mind - in order to find out more about regressors and classifiers, please refer to the API documentation of the package

Import all necessary packages. First, it's needed to ensure if you have ml_preprocessing and ml_dataframe packages in your dependencies:

dependencies:
  ml_dataframe: ^1.5.0
  ml_preprocessing: ^7.0.2

We need these repos to parse raw data in order to use it further. For more details, please visit ml_preprocessing repository page.

Important note:

Regressors and classifiers exposed by the library do not handle strings, booleans and nulls, they can only deal with numbers! You necessarily need to convert all the improper values of your dataset to numbers, please refer to ml_preprocessing library to find out more about data preprocessing.

import 'package:ml_algo/ml_algo.dart';
import 'package:ml_dataframe/ml_dataframe.dart';
import 'package:ml_preprocessing/ml_preprocessing.dart';

Read a dataset's file

We have 2 options here:

Instructions

For a desktop application:

Just provide a proper path to your downloaded file and use a function-factory fromCsv from ml_dataframe package to read the file:

final samples = await fromCsv('datasets/pima_indians_diabetes_database.csv');

For a flutter application:

It's needed to add the dataset to the flutter assets by adding the following config in the pubspec.yaml:

flutter:
  assets:
    - assets/datasets/pima_indians_diabetes_database.csv

You need to create the assets directory in the file system and put the dataset's file there. After that you can access the dataset:

import 'package:flutter/services.dart' show rootBundle;
import 'package:ml_dataframe/ml_dataframe.dart';

void main() async {
  final rawCsvContent = await rootBundle.loadString('assets/datasets/pima_indians_diabetes_database.csv');
  final samples = DataFrame.fromRawCsv(rawCsvContent);
}
Instructions
import 'package:ml_dataframe/ml_dataframe.dart';

void main() {
  final samples = getPimaIndiansDiabetesDataFrame();
}

Prepare datasets for training and testing

Data in this file is represented by 768 records and 8 features. The 9th column is a label column, it contains either 0 or 1 on each row. This column is our target - we should predict a class label for each observation. The column's name is Outcome. Let's store it:

final targetColumnName = 'Outcome';

Now it's the time to prepare data splits. Since we have a smallish dataset (only 768 records), we can't afford to split the data into just train and test sets and evaluate the model on them, the best approach in our case is Cross-Validation. According to this, let's split the data in the following way using the library's splitData function:

final splits = splitData(samples, [0.7]);
final validationData = splits[0];
final testData = splits[1];

splitData accepts a DataFrame instance as the first argument and ratio list as the second one. Now we have 70% of our data as a validation set and 30% as a test set for evaluating generalization errors.

Set up a model selection algorithm

Then we may create an instance of CrossValidator class to fit the hyperparameters of our model. We should pass validation data (our validationData variable), and a number of folds into CrossValidator constructor.

final validator = CrossValidator.kFold(validationData, numberOfFolds: 5);

Let's create a factory for the classifier with desired hyperparameters. We have to decide after the cross-validation if the selected hyperparameters are good enough or not:

final createClassifier = (DataFrame samples) =>
  LogisticRegressor(
    samples
    targetColumnName,
  );

If we want to evaluate the learning process more thoroughly, we may pass collectLearningData argument to the classifier constructor:

final createClassifier = (DataFrame samples) =>
  LogisticRegressor(
    ...,
    collectLearningData: true,
  );

This argument activates collecting costs per each optimization iteration, and you can see the cost values right after the model creation.

Evaluate the performance of the model

Assume, we chose perfect hyperparameters. In order to validate this hypothesis, let's use CrossValidator instance created before:

final scores = await validator.evaluate(createClassifier, MetricType.accuracy);

Since the CrossValidator instance returns a Vector of scores as a result of our predictor evaluation, we may choose any way to reduce all the collected scores to a single number, for instance, we may use Vector's mean method:

final accuracy = scores.mean();

Let's print the score:

print('accuracy on k fold validation: ${accuracy.toStringAsFixed(2)}');

We can see something like this:

accuracy on k fold validation: 0.75

Let's assess our hyperparameters on the test set in order to evaluate the model's generalization error:

final testSplits = splitData(testData, [0.8]);
final classifier = createClassifier(testSplits[0]);
final finalScore = classifier.assess(testSplits[1], MetricType.accuracy);

The final score is like:

print(finalScore.toStringAsFixed(2)); // approx. 0.75

If we specified collectLearningData parameter, we may see costs per each iteration in order to evaluate how our cost changed from iteration to iteration during the learning process:

print(classifier.costPerIteration);

Write the model to a json file

Seems, our model has a good generalization ability, and that means we may use it in the future. To do so we may store the model in a file as JSON:

await classifier.saveAsJson('diabetes_classifier.json');

After that we can simply read the model from the file and make predictions:

import 'dart:io';

void main() {
  // ...
  final fileName = 'diabetes_classifier.json';
  final file = File(fileName);
  final encodedModel = await file.readAsString();
  final classifier = LogisticRegressor.fromJson(encodedModel);
  final unlabelledData = await fromCsv('some_unlabelled_data.csv');
  final prediction = classifier.predict(unlabelledData);

  print(prediction.header); // ('class variable (0 or 1)')
  print(prediction.rows); // [ 
                        //   (1),
                        //   (0),
                        //   (0),
                        //   (1),
                        //   ...,
                        //   (1),
                        // ]
  // ...
}

Please note that all the hyperparameters that we used to generate the model are persisted as the model's read-only fields, and we can access them anytime:

print(classifier.iterationsLimit);
print(classifier.probabilityThreshold);
// and so on
All the code for a desktop application:
import 'package:ml_algo/ml_algo.dart';
import 'package:ml_dataframe/ml_dataframe.dart';
import 'package:ml_preprocessing/ml_preprocessing.dart';

void main() async {
  // Another option - to use a toy dataset:
  // final samples = getPimaIndiansDiabetesDataFrame();
  final samples = await fromCsv('datasets/pima_indians_diabetes_database.csv', headerExists: true);
  final targetColumnName = 'Outcome';
  final splits = splitData(samples, [0.7]);
  final validationData = splits[0];
  final testData = splits[1];
  final validator = CrossValidator.kFold(validationData, numberOfFolds: 5);
  final createClassifier = (DataFrame samples) =>
    LogisticRegressor(
      samples
      targetColumnName,
    );
  final scores = await validator.evaluate(createClassifier, MetricType.accuracy);
  final accuracy = scores.mean();
  
  print('accuracy on k fold validation: ${accuracy.toStringAsFixed(2)}');

  final testSplits = splitData(testData, [0.8]);
  final classifier = createClassifier(testSplits[0], targetNames);
  final finalScore = classifier.assess(testSplits[1], targetNames, MetricType.accuracy);
  
  print(finalScore.toStringAsFixed(2));

  await classifier.saveAsJson('diabetes_classifier.json');
}
All the code for a flutter application:
import 'package:flutter/services.dart' show rootBundle;
import 'package:ml_algo/ml_algo.dart';
import 'package:ml_dataframe/ml_dataframe.dart';
import 'package:ml_preprocessing/ml_preprocessing.dart';

void main() async {
  final rawCsvContent = await rootBundle.loadString('assets/datasets/pima_indians_diabetes_database.csv');
  // Another option - to use a toy dataset:
  // final samples = getPimaIndiansDiabetesDataFrame();
  final samples = DataFrame.fromRawCsv(rawCsvContent);
  final targetColumnName = 'Outcome';
  final splits = splitData(samples, [0.7]);
  final validationData = splits[0];
  final testData = splits[1];
  final validator = CrossValidator.kFold(validationData, numberOfFolds: 5);
  final createClassifier = (DataFrame samples) =>
    LogisticRegressor(
      samples
      targetColumnName,
    );
  final scores = await validator.evaluate(createClassifier, MetricType.accuracy);
  final accuracy = scores.mean();
  
  print('accuracy on k fold validation: ${accuracy.toStringAsFixed(2)}');

  final testSplits = splitData(testData, [0.8]);
  final classifier = createClassifier(testSplits[0], targetNames);
  final finalScore = classifier.assess(testSplits[1], targetNames, MetricType.accuracy);
  
  print(finalScore.toStringAsFixed(2));

  await classifier.saveAsJson('diabetes_classifier.json');
}

Linear regression

Let's try to predict house prices using linear regression and the famous Boston Housing dataset. The dataset contains 13 independent variables and 1 dependent variable - medv which is the target one (you can find the dataset in e2e/_datasets/housing.csv).

Again, first we need to download the file and create a dataframe. The dataset is headless, we may either use autoheader or provide our own header. Let's use autoheader in our example:

For a desktop application:

Just provide a proper path to your downloaded file and use a function-factory fromCsv from ml_dataframe package to read the file:

final samples = await fromCsv('datasets/housing.csv', headerExists: false, columnDelimiter: ' ');

For a flutter application:

It's needed to add the dataset to the flutter assets by adding the following config in the pubspec.yaml:

flutter:
  assets:
    - assets/datasets/housing.csv

You need to create the assets directory in the file system and put the dataset's file there. After that you can access the dataset:

import 'package:flutter/services.dart' show rootBundle;
import 'package:ml_dataframe/ml_dataframe.dart';

final rawCsvContent = await rootBundle.loadString('assets/datasets/housing.csv');
final samples = DataFrame.fromRawCsv(rawCsvContent, fieldDelimiter: ' ');

Prepare the dataset for training and testing

Data in this file is represented by 505 records and 13 features. The 14th column is a target. Since we use autoheader, the target's name is autogenerated and it is col_13. Let's store it in a variable:

final targetName = 'col_13';

then let's shuffle the data:

final shuffledSamples = samples.shuffle();

Now it's the time to prepare data splits. Let's split the data into train and test subsets using the library's splitData function:

final splits = splitData(samples, [0.8]);
final trainData = splits[0];
final testData = splits[1];

splitData accepts a DataFrame instance as the first argument and ratio list as the second one. Now we have 80% of our data as a train set and 20% as a test set.

Let's train the model:

final model = LinearRegressor(trainData, targetName);

By default, LinearRegressor uses a closed-form solution to train the model. One can also use a different solution type, e.g. stochastic gradient descent algorithm:

final model = LinearRegressor.SGD(
  shuffledSamples
  targetName,
  iterationLimit: 90,
);

or linear regression based on coordinate descent with Lasso regularization:

final model = LinearRegressor.lasso(
  shuffledSamples,
  targetName,
  iterationLimit: 90,
);

Next, we should evaluate performance of our model:

final error = model.assess(testData, MetricType.mape);

print(error);

If we are fine with the error, we can save the model for the future use:

await model.saveAsJson('housing_model.json');

Later we may use our trained model for prediction:

import 'dart:io';
import 'package:ml_algo/ml_algo.dart';
import 'package:ml_dataframe/ml_dataframe.dart';

void main() async {
  final file = File('housing_model.json');
  final encodedModel = await file.readAsString();
  final model = LinearRegressor.fromJson(encodedModel);
  final unlabelledData = await fromCsv('some_unlabelled_data.csv');
  final prediction = model.predict(unlabelledData);
    
  print(prediction.header);
  print(prediction.rows);
}
All the code for a desktop application:
import 'package:ml_algo/ml_algo.dart';
import 'package:ml_dataframe/ml_dataframe.dart';

void main() async {
  final samples = (await fromCsv('datasets/housing.csv', headerExists: false, columnDelimiter: ' ')).shuffle();
  final targetName = 'col_13';
  final splits = splitData(samples, [0.8]);
  final trainData = splits[0];
  final testData = splits[1];
  final model = LinearRegressor(trainData, targetName);
  final error = model.assess(testData, MetricType.mape);
  
  print(error);

  await classifier.saveAsJson('housing_model.json');
}
All the code for a flutter application:
import 'package:flutter/services.dart' show rootBundle;
import 'package:ml_algo/ml_algo.dart';
import 'package:ml_dataframe/ml_dataframe.dart';

void main() async {
  final rawCsvContent = await rootBundle.loadString('assets/datasets/housing.csv');
  final samples = DataFrame.fromRawCsv(rawCsvContent, fieldDelimiter: ' ').shuffle();
  final targetName = 'col_13';
  final splits = splitData(samples, [0.8]);
  final trainData = splits[0];
  final testData = splits[1];
  final model = LinearRegressor(trainData, targetName);
  final error = model.assess(testData, MetricType.mape);
    
  print(error);
  
  await classifier.saveAsJson('housing_model.json');
}

Decision tree-based classification

Let's try to classify data from a well-known Iris dataset using a non-linear algorithm - decision trees

First, you need to download the data and place it in a proper place in your file system. To do so you should follow the instructions which are given in the Logistic regression section. Or you may use getIrisDataFrame function that returns ready to use DataFrame instance filled with Irisdataset.

After loading the data, it's needed to preprocess it. We should drop the Id column since the column doesn't make sense. Also, we need to encode the 'Species' column - originally, it contains 3 repeated string labels, to feed it to the classifier it's needed to convert the labels into numbers:

import 'package:ml_algo/ml_algo.dart';
import 'package:ml_dataframe/ml_dataframe.dart';
import 'package:ml_preprocessing/ml_preprocessing.dart';

void main() async {
    final samples = getIrisDataset()
      .shuffle()
      .dropSeries(seriesNames: ['Id']);
    
    final pipeline = Pipeline(samples, [
      encodeAsIntegerLabels(
        featureNames: ['Species'], // Here we convert strings from 'Species' column into numbers
      ),
    ]);
}

Next, let's create a model:

final model = DecisionTreeClassifier(
  processed,
  'Species',
  minError: 0.3,
  minSamplesCount: 5,
  maxDepth: 4,
);

As you can see, we specified 3 hyperparameters: minError, minSamplesCount and maxDepth. Let's look at the parameters in more detail:

  • minError. A minimum error on a tree node. If the error is less than or equal to the value, the node is considered a leaf.
  • minSamplesCount. A minimum number of samples on a node. If the number of samples is less than or equal to the value, the node is considered a leaf.
  • maxDepth. A maximum depth of the resulting decision tree. Once the tree reaches the maxDepth, all the level's nodes are considered leaves.

All the parameters serve as stopping criteria for the tree building algorithm.

Now we have a ready to use model. As usual, we can save the model to a JSON file:

await model.saveAsJson('path/to/json/file.json');

Unlike other models, in the case of a decision tree, we can visualise the algorithm result - we can save the model as an SVG file:

await model.saveAsSvg('path/to/svg/file.svg');

Once we saved it, we can open the file through any image viewer, e.g. through a web browser. An example of the resulting SVG image:

KDTree-based data retrieval

Let's take a look at another field of machine learning - data retrieval. The field is represented by a family of algorithms, one of them is KDTree which is exposed by the library.

KDTree is an algorithm that divides the whole search space into partitions in form of the binary tree which makes it efficient to retrieve data.

Let's retrieve some data points through a kd-tree built on the Iris dataset.

First, we need to prepare the data. To do so, it's needed to load the dataset. For this purpose, we may use getIrisDataFrame function from ml_dataframe. The function returns prefilled with the Iris data DataFrame instance:

import 'package:ml_algo/ml_algo.dart';
import 'package:ml_dataframe/ml_dataframe.dart';

void main() {
  final originalData = getIrisDataFrame();
}

Since the dataset contains Id column that doesn't make sense and Species column that contains text data, we need to drop these columns:

import 'package:ml_algo/ml_algo.dart';
import 'package:ml_dataframe/ml_dataframe.dart';

void main() {
  final originalData = getIrisDataFrame();
  final data = originalData.dropSeries(names: ['Id', 'Species']);
}

Next, we can build the tree:

import 'package:ml_algo/ml_algo.dart';
import 'package:ml_dataframe/ml_dataframe.dart';

void main() {
  final originalData = getIrisDataFrame();
  final data = originalData.dropSeries(names: ['Id', 'Species']);
  final tree = KDTree(data);
}

And query nearest neighbours for an arbitrary point. Let's say, we want to find 5 nearest neighbours for the point [6.5, 3.01, 4.5, 1.5]:

import 'package:ml_algo/ml_algo.dart';
import 'package:ml_dataframe/ml_dataframe.dart';
import 'package:ml_linalg/vector.dart';

void main() {
  final originalData = getIrisDataFrame();
  final data = originalData.dropSeries(names: ['Id', 'Species']);
  final tree = KDTree(data);
  final neighbourCount = 5;
  final point = Vector.fromList([6.5, 3.01, 4.5, 1.5]);
  final neighbours = tree.query(point, neighbourCount);
 
  print(neighbours);
}

The last instruction prints the following:

(Index: 75, Distance: 0.17349341930302867), (Index: 51, Distance: 0.21470911402365767), (Index: 65, Distance: 0.26095956499211426), (Index: 86, Distance: 0.29681616124778537), (Index: 56, Distance: 0.4172527193942372))

The nearest point has an index 75 in the original data. Let's check a record at the index:

import 'package:ml_dataframe/ml_dataframe.dart';

void main() {
  final originalData = getIrisDataFrame();
 
  print(originalData.rows.elementAt(75));
}

It prints the following:

(76, 6.6, 3.0, 4.4, 1.4, Iris-versicolor)

Remember, we dropped Id and Species columns which are the very first and the very last elements in the output, so the rest elements, 6.6, 3.0, 4.4, 1.4 look quite similar to our target point - 6.5, 3.01, 4.5, 1.5, so the query result makes sense.

If you want to use KDTree outside the ml_algo ecosystem, meaning you don't want to use ml_linalg and ml_dataframe packages in your application, you may import only KDTree library and use fromIterable constructor and queryIterable method to perform the query:

import 'package:ml_algo/kd_tree.dart';

void main() async {
  final tree = KDTree.fromIterable([
    // some data here
  ]);
  final neighbourCount = 5;
  final neighbours = tree.queryIterable([/* some point here */], neighbourCount);
 
  print(neighbours);
}

As usual, we can persist our tree by saving it to a JSON file:

import 'dart:io';
import 'package:ml_algo/ml_algo.dart';
import 'package:ml_dataframe/ml_dataframe.dart';

void main() {
  final originalData = getIrisDataFrame();
  final data = originalData.dropSeries(names: ['Id', 'Species']);
  final tree = KDTree(data);
 
  // ...

  await tree.saveAsJson('path/to/json/file.json');
 
  // ...

  final file = await File('path/to/json/file.json').readAsString();
  final encodedTree = jsonDecode(file) as Map<String, dynamic>;
  final restoredTree = KDTree.fromJson(encodedTree);

  print(restoredTree);
}

Models retraining

Someday our previously shining model can degrade in terms of prediction accuracy - in this case, we can retrain it. Retraining means simply re-running the same learning algorithm that was used to generate our current model keeping the same hyperparameters but using a new data set with the same features:

import 'dart:io';

final fileName = 'diabetes_classifier.json';
final file = File(fileName);
final encodedModel = await file.readAsString();
final classifier = LogisticRegressor.fromJson(encodedModel);

// ... 
// here we do something and realize that our classifier performance is not so good
// ...

final newData = await fromCsv('path/to/dataset/with/new/data/to/retrain/the/classifier');
final retrainedClassifier = classifier.retrain(newData);

The workflow with other predictors (SoftmaxRegressor, DecisionTreeClassifier and so on) is quite similar to the described above for LogisticRegressor, feel free to experiment with other models.

A couple of words about linear models which use gradient optimisation methods

Sometimes you may get NaN or Infinity as a value of your score, or it may be equal to some inconceivable value (extremely big or extremely low). To prevent so, you need to find a proper value of the initial learning rate, and also you may choose between the following learning rate strategies: constant, timeBased, stepBased and exponential:

final createClassifier = (DataFrame samples) =>
    LogisticRegressor(
      ...,
      initialLearningRate: 1e-5,
      learningRateType: LearningRateType.timeBased,
      ...,
    );

Helpful articles on algorithms standing behind the library

Contacts

If you have questions, feel free to text me on

ml_algo's People

Contributors

gyrdym avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

ml_algo's Issues

Persistence for LogisticRegressor

Hello,
At the moment I have to retrain the model each time I use it. Fortunately, this does not take a lot of time, but because of that all sample data has to be present anywhere the model is used. One workaround that I found is to manually initialize a LinearRegressorImpl object with coefficients obtained from a trained LogisticRegressor model. However, this requires importing package private files which is not ideal. Adding a way to persist regressor models would be a great improvement!

Best regards.

Random forest implementation

Hello !
I have been using flutter on my free time and found out about this interesting library.
As a side project, I would like to add an implementation of the random forest to the library via a pull request.
Would you be interested ?

Exception: The dimension of the vector and the columns number of the matrix mismatch

Hello, I have a chart with many data points in my Flutter app and I am trying to draw a trend line in it as you can see in this example: https://google.github.io/charts/flutter/example/combo_charts/scatter_plot_line
To do so, I have programmed the class seen below and implemented the LinearRegressor in it. I want to display the line by determining the y-values for two values on the x-axis of my chart using my predict function and then drawing a line through those points. However, it seems that there is a bug in my predict function which you can see below and I can't quite figure it out.
I can say, that in the train function I have two columns, in the first column are the x values and in the second column are the corresponding y values of the points in the chart. I then use this data to train the LinearRegressor. The predict function takes an x-value, which is why the ySeries, the second column, is empty there. With one row and two columns, the second of which is empty, I then try to predict this empty space, which is the y-value.
I assume that something is wrong here in my implementation of the prediction, but unfortunately I don't understand what. I hope the error description is sufficient.

This is the error I get:

E/flutter ( 8937): [ERROR:flutter/lib/ui/ui_dart_state.cc(199)] Unhandled Exception: Exception: The dimension of the vector and the columns number of the matrix mismatch
E/flutter ( 8937): #0      MatrixImpl._matrixVectorMul (package:ml_linalg/src/matrix/matrix_impl.dart:500:7)
E/flutter ( 8937): #1      MatrixImpl.* (package:ml_linalg/src/matrix/matrix_impl.dart:89:14)
E/flutter ( 8937): #2      LinearRegressorImpl.predict (package:ml_algo/src/regressor/linear_regressor/linear_regressor_impl.dart:152:11)
E/flutter ( 8937): #3      AnalyticsLinearRegression.predict (package:trimlog/services/ml/analytics_linear_regression.dart:87:35)
E/flutter ( 8937): <asynchronous suspension>
E/flutter ( 8937): #4      _AnalyticGraphState.build.<anonymous closure>.<anonymous closure>.<anonymous closure>._predict.<anonymous closure> (package:trimlog/screens/analytics/analytics_graphs.dart:143:49)
E/flutter ( 8937): <asynchronous suspension>
E/flutter ( 8937):

This is the class used for the linear regression:

class AnalyticsLinearRegression {
  List<Trim> trims;
  final String xCategory;
  final String xParameter;
  final String yCategory;
  final String yParameter;

  AnalyticsLinearRegression(this.trims, this.xCategory, this.xParameter, this.yCategory, this.yParameter);

  List<Series> _prepareData() {
    // Remove trims which do not contain the parameter shown in this analytic
    List<Trim> temp = new List.from(trims);
    trims.forEach((trim) {
      Map<String, dynamic> map = trim.toMap();
      if ((!(map[xCategory] as Map).containsKey(xParameter)) || (map[xCategory][xParameter] == null) || (!(map[yCategory] as Map).containsKey(yParameter)) || (map[yCategory][yParameter] == null))
        temp.remove(trim);
    });
    trims = temp;
    // Extract the parameters shown in the analytic from the trims
    List x = [];
    List y = [];
    trims.forEach((trim) {
      Map<String, dynamic> map = trim.toMap();
      x.add((map[xCategory][xParameter] is List ? map[xCategory][xParameter].first : map[xCategory][xParameter]) * 1.0);
      y.add((map[yCategory][yParameter] is List ? map[yCategory][yParameter].first : map[yCategory][yParameter]) * 1.0);
    });
    Series xSeries = Series(xParameter, x); // First column, given parameter
    Series ySeries = Series(yParameter, y); // Second column, predicted parameter
    return [xSeries, ySeries];
  }

  Future train() async {
    final Iterable<Series> data = _prepareData();
    final dataFrame = DataFrame.fromSeries(data);
    if (dataFrame.rows.length <= 2) return; // <= 2 datapoints results in errors
    final targetColumnName = yParameter; // The second column (y) contains the parameter that I later want to predict
    final splits = splitData(dataFrame, [0.7]);
    final validationData = splits[0];
    // final testData = splits[1];
    final validator = CrossValidator.kFold(validationData, numberOfFolds: validationData.rows.length - 1);
    final createClassifier = (DataFrame samples) => LinearRegressor(
          samples,
          targetColumnName,
          optimizerType: LinearOptimizerType.gradient,
          iterationsLimit: 90,
          learningRateType: LearningRateType.decreasingAdaptive,
          batchSize: samples.rows.length,
        );
    final scores = await validator.evaluate(createClassifier, MetricType.rmse);
    final accuracy = scores.mean();
    print('Accuracy on root mean squared error (RMSE) validation: ${accuracy.toStringAsFixed(2)}');
    // final testSplits = splitData(testData, [1.00]);
    // final classifier = createClassifier(testSplits[0]);
    // final finalScore = classifier.assess(testSplits[1], MetricType.rmse);
    // print(finalScore.toStringAsFixed(2));
    // await classifier.saveAsJson(xParameter + '_' + yParameter + '_classifier.json');
    final classifier = createClassifier(dataFrame);
    await classifier.saveAsJson(await _classifierPath);
  }

  Future retrain(List<Trim> newData) async {
    final classifier = await _linearRegressor;
    trims = newData;
    final Iterable<Series> data = _prepareData();
    final dataFrame = DataFrame.fromSeries(data);
    final retrainedClassifier = classifier.retrain(dataFrame);
    await retrainedClassifier.saveAsJson(await _classifierPath);
  }

  /// Predicts the y value (seceond column) of a given x value (double)
  /// Can be used / Is used to get to points and draw a line through these points as a trendline (like here: https://google.github.io/charts/flutter/example/combo_charts/scatter_plot_line)
  Future<double> predict(double x) async {
    final classifier = await _linearRegressor;
    Series xSeries = Series(xParameter, [x]); // First column value (x)
    Series ySeries = Series(yParameter, []); // Second column value (y) should get predicted and returned, therefore this is empty
    final data = DataFrame.fromSeries([xSeries, ySeries]);
    final prediction = classifier.predict(data); // Predict the corresponding y value to the given x value
    return prediction.rows.first.first; // Prediction should only contain one row and this row should contain the predicted y value
  }

  Future<String> get _classifierPath async => (await getTemporaryDirectory()).path + "/" + xParameter + '_' + yParameter + '_classifier.json'; // Path where the classifier is saved
  Future<File> get _file async => File(await _classifierPath); // File containing the classifier (JSON)
  Future<String> get _encodedModel async => (await _file).readAsString(); // Classifier as JSON
  Future<LinearRegressor> get _linearRegressor async => LinearRegressor.fromJson(await _encodedModel); // Linear regressor from file
}

LinearRegressor with OrdinaryLeastSquares

I see we can only choose between Gradient and Coordinate. I believe this is why the data I'm getting back isn't what I expect. If that's not the reason, please let me know, thanks!

For example, if I have the data

Grind, RoastLevel, Time
5.5, 5, 20
5.25, 5, 22

And I make RoastLevel and Time independent variables, and Grind dependent, I can get an expected prediction with Python using Sklearn out of the box.

data_minus_grind = []
grind_data = []
for row in floats:
    data_minus_grind.append(row[1:3])
    grind_data.append(row[0])

model = LinearRegression().fit(data_minus_grind, grind_data)
prediction_data = [[5,25]]
prediction = model.predict(prediction_data)

This prediction, trying to predict the Grind for the Time 25, I receive 4.875, which sounds about right. Grind should go down while Time goes up.

However, if I try to use this library, my slope is always in the wrong direction, with the two variables moving in the same direction for some reason.

If I try

final samples = DataFrame.fromRawCsv(rawCsvContent, headerExists: true);
const targetColName = "Grind";
final defaultRegressor = LinearRegressor(
    samples,
    targetColName
);
final dataToPredict = [
    // Roast level, time
    [ 5, 25.0 ]
];
final dataframeToPredict = DataFrame(dataToPredict, headerExists: false);
final prediction = regressor.predict(dataframeToPredict);

Then the result for Grind with a Time of 25.0 is 6.79. As I move up the time, Grind should decrease, but instead it increases. I've tried tweaking many of the parameters but haven't found a fix.

Thanks!

Examples of configuration for LinearRegressor?

Hey,

Thanks a lot for the library. Really impressed with how much you can do with dart!

Trying to run a linear regression for a simple line y(x) = x, found following issues which I suppose are due to configuration of the regressor. Please help to configure

The code below gives my expected result for most of the cases, with k around 1.00. However in some cases, i.e.

a=1 n=10 -> k (0.9994153380393982) rows ((9.994153022766113))
a=0 n=10 -> k (0.3038938045501709) rows ((3.038938045501709))
a=-10 n=10 -> k (0.5980027318000793) rows ((5.980027198791504))
a=1 n=100 -> k (NaN) rows ((0.0))  

the result is different. Is this because of the configuration?

Also is there a way to retrieve b from y(x) = kx + b?

Thank you!

import 'package:ml_algo/ml_algo.dart';
import 'package:ml_dataframe/ml_dataframe.dart';
import 'package:xrange/xrange.dart';

main() {
 var a = 1;
 var n = 100;

 var _data = NumRange.closed(a, n).values().map((it) => [it, it]) ;

 final data = [['x', 'y'], ..._data];

 print(data);

 final samples = DataFrame(data, headerExists: true);
 final regressor = LinearRegressor(samples, 'y');

 var prediction = regressor.predict(DataFrame([['x', 'y'], [10.0,]],));

 print("a=$a n=$n -> k ${regressor.coefficients} rows ${prediction.rows}");
}

'diabetes_classifier.json' (OS Error: Read-only file system, errno = 30)

Hi, I have been trying to replicate the example here, but I cannot write the json classifier model due to the following error:

Running "flutter pub get" in logistic_regressor...
Launching lib\main.dart on sdk gphone x86 in debug mode...
Running Gradle task 'assembleDebug'...
√  Built build\app\outputs\flutter-apk\app-debug.apk.
Installing build\app\outputs\flutter-apk\app.apk...
Debug service listening on ws://127.0.0.1:64760/6bAEB-pabM4=/ws
Syncing files to device sdk gphone x86...
I/flutter ( 7530): accuracy on k fold validation: 0.63
I/flutter ( 7530): 0.76
E/flutter ( 7530): [ERROR:flutter/lib/ui/ui_dart_state.cc(199)] Unhandled Exception: FileSystemException: Cannot create file, path = 'diabetes_classifier.json' (OS Error: Read-only file system, errno = 30)
E/flutter ( 7530): #0      _File.create.<anonymous closure> (dart:io/file_impl.dart:255:9)
E/flutter ( 7530): #1      _rootRunUnary (dart:async/zone.dart:1362:47)
E/flutter ( 7530): #2      _CustomZone.runUnary (dart:async/zone.dart:1265:19)
E/flutter ( 7530): <asynchronous suspension>
E/flutter ( 7530): #3      SerializableMixin.saveAsJson (package:ml_algo/src/common/serializable/serializable_mixin.dart:9:18)
E/flutter ( 7530): <asynchronous suspension>
E/flutter ( 7530): #4      _MyHomePageState.trainModel (package:logistic_regressor/main.dart:99:5)
E/flutter ( 7530): <asynchronous suspension>
E/flutter ( 7530): 

I have tried using permission in android/app/src/main/AndroidManifest.xml

<manifest xmlns:android="http://schemas.android.com/apk/res/android"
    package="com.example.logistic_regressor">
    <uses-permission android:name="android.permission.MANAGE_EXTERNAL_STORAGE" />

and also requested the permission via Permission.manageExternalStorage.request() in my main.dart:

import 'dart:io';
import 'package:flutter/material.dart';
import 'package:ml_algo/ml_algo.dart';
import 'package:ml_dataframe/ml_dataframe.dart';
import 'package:ml_preprocessing/ml_preprocessing.dart';
import 'package:flutter/services.dart' show rootBundle;

import 'package:permission_handler/permission_handler.dart';

void main() {
  runApp(MyApp());
}

class MyApp extends StatelessWidget {
  @override
  Widget build(BuildContext context) {
    return MaterialApp(
      title: 'Flutter Demo',
      theme: ThemeData(
        primarySwatch: Colors.blue,
      ),
      home: MyHomePage(title: 'Flutter Demo Home Page'),
    );
  }
}

class MyHomePage extends StatefulWidget {
  MyHomePage({Key? key, required this.title}) : super(key: key);
  final String title;

  @override
  _MyHomePageState createState() => _MyHomePageState();
}

class _MyHomePageState extends State<MyHomePage> {
  void trainModel() async {
    final rawCsvContent = await rootBundle.loadString('datasets/pima_indians_diabetes_database.csv');
    final samples = DataFrame.fromRawCsv(rawCsvContent);

    // === Prepare Dataset ===
    final targetColumnName = 'class variable (0 or 1)';

    final splits = splitData(samples, [0.7]);
    final validationData = splits[0];
    final testData = splits[1];

    // === Setup model selection algorithm ===
    final validator = CrossValidator.kFold(validationData, numberOfFolds: 5);

    final createClassifier = (DataFrame samples) =>
        LogisticRegressor(
          samples,
          targetColumnName,
          optimizerType: LinearOptimizerType.gradient,
          iterationsLimit: 90,
          learningRateType: LearningRateType.decreasingAdaptive,
          batchSize: samples.rows.length,
          probabilityThreshold: 0.7,
          collectLearningData: true,
        );

    // === Evaluate model performance ===
    final scores = await validator.evaluate(createClassifier, MetricType.accuracy);
    final accuracy = scores.mean();
    print('accuracy on k fold validation: ${accuracy.toStringAsFixed(2)}');

    final testSplits = splitData(testData, [0.8]);
    final classifier = createClassifier(testSplits[0]);
    final finalScore = classifier.assess(testSplits[1], MetricType.accuracy);

    print(finalScore.toStringAsFixed(2)); // approx. 0.75

    // === Write the model to JSON file ===
    var status = await Permission.manageExternalStorage.status;
    if (status.isDenied) {
      await Permission.manageExternalStorage.request();
    }

    await classifier.saveAsJson('diabetes_classifier.json');
  }

  @override
  void initState() {
    // TODO: implement initState
    super.initState();
    trainModel();
  }

  @override
  Widget build(BuildContext context) {
    return new MaterialApp(
      home: new Scaffold(
        appBar: new AppBar(
          title: new Text('Plugin example app'),
        ),
        body: new Center(
          child: new Column(children: <Widget>[
            new Text('Running'),
          ]),
        ),
      ),
    );
  }
}

If anyone got insight on what the problem is, I would really appreciate if you could help.
Cheers!

support vector machine

Hi! Thanks for the great plugin!
Are you planning to make a classifier with support vector machine?
If that is the case, when is it going to be released?
Thanks.

Path down decision tree

Is there a way to get the path down the tree for a prediction from the decision tree? I can import the implementation, copy/paste the predict method and add nodes to a list, but that's not ideal.

Web support?

Why does this lib doesn't work on the web platform? any plan to implement web support?

Reinforcement Learning

How difficult would it be to implement RL aglorithms / provide support for RL-based training methods? Is this on the roadmap by any chance?

Compatibility problems with other packages

I'm facing a problem with ml_algo. The package requires > 3.0.1 for json_annotation, but I also use build_runner that requires > 4.0.1. This stop my development because pub get doesn't work with this incompatibility. If it won't be a problem please update all json_annotation to 4.0.1 from the packages.

How to persist DTree?

Hi, when I made a DecisionTreeClassification how can I presist it's model so we can reuse it in my next sessions because training in DTree takes a lot of time?

Blank invalid exception while creating classifier

Here is my data :

(src, day, time, dest)
(0, 0, 450, 4)
(1, 0, 110, 5)
(0, 1, 450, 4)
(1, 1, 110, 5)
(0, 2, 450, 4)
(1, 2, 110, 5)
(0, 3, 450, 4)
(1, 3, 110, 5)
(0, 4, 450, 4)
(1, 4, 110, 5)
(0, 5, 450, 4)
(1, 5, 110, 5)
(2, 6, 660, 6)
(3, 6, 1170, 7)
(0, 0, 450, 4)
(1, 0, 110, 5)
(0, 1, 450, 4)
(1, 1, 110, 5)
(0, 2, 450, 4)
(1, 2, 110, 5)
(0, 3, 450, 4)
(1, 3, 110, 5)
(0, 4, 450, 4)
(1, 4, 110, 5)
(0, 5, 450, 4)
(1, 5, 110, 5)
(2, 6, 660, 6)
(3, 6, 1170, 8)

And it then throws this exception while trying try to create the classifier:

Unhandled exception:
Invalid argument(s)
#0      _TypedList._setFloat32 (dart:typed_data-patch/typed_data_patch.dart:2126:36)
#1      _Float32ArrayView.[]= (dart:typed_data-patch/typed_data_patch.dart:4461:16)
#2      new Float32MatrixDataManager.fromList
package:ml_linalg//data_manager/float32_matrix_data_manager.dart:37

#3      MatrixFactoryImpl.fromList
package:ml_linalg//matrix/matrix_factory_impl.dart:21
#4      new Matrix.fromList
package:ml_linalg/matrix.dart:42
#5      DataFrameImpl.toMatrix
package:ml_dataframe//data_frame/data_frame_impl.dart:143
#6      createLogLikelihoodOptimizer
package:ml_algo//_helpers/create_log_likelihood_optimizer.dart:46

#7      LogisticRegressorFactoryImpl.create
package:ml_algo//logistic_regressor/logistic_regressor_factory_impl.dart:58
#8      new LogisticRegressor
package:ml_algo//logistic_regressor/logistic_regressor.dart:153
#9      main.<anonymous closure>
bin\knn.dart:41
#10     main
bin\knn.dart:53
<asynchronous suspension>

Classifier is constructed this way :

 final createClassifier = (DataFrame samples) => LogisticRegressor(
        samples,
        targetColumnName,
        optimizerType: LinearOptimizerType.gradient,
        iterationsLimit: 90,
        learningRateType: LearningRateType.decreasingAdaptive,
        batchSize: samples.rows.length,
        probabilityThreshold: 0.7,
      );

Data visualisation [enhancement]

  1. Support for real-time visualizations: The library could support real-time visualizations to enable users to see the results of their algorithms in real time as they are being trained.
  2. Interactive visualizations, such as dynamic scatter plots, 3D plots, and interactive decision trees could be implemented in the library to allow users to explore and understand their data better.

KDTree query() function returning wrong nearest neighbor

I want to return the nearest neighbor to my data point. However, it is returning the wrong one. I tried returning multiple nearest neighbors and there also seem to be some inconsistencies with the returned list of neighbors.

In the screenshot below, if I return the nearest neighbor, it has distance 8,9. But if I return the two nearest neighbors then it actually returns to me the two nearest ones with distances 2,0 and 3,19.

Screenshot 2024-03-18 at 09 49 11

Also when returning the 3 nearest and the 4 nearest neighbors. On the k=3, the values are wrong

Integrate python algorithms like xgboost using ffi

This is a great project. I started to hate Python after using Dart with Flutter. I realized that I am still googling every basic stuff when using Python whereas with Dart it just takes microseconds to find the right method after putting dot. I wonder if it is possible to integrate popular machine learning algorithms like xgboost in Dart using FFI.

Thanks a lot for creating this library.

Data persistence for Knn model

Hi, what would be the best way to persist a knn model without recreating it every time? Is this also serializable, or does this model always need data to compute predictions lazily?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.