Git Product home page Git Product logo

Comments (5)

HeftyCoder avatar HeftyCoder commented on June 18, 2024 1

Making abstractions for the models shouldn't be that hard. The major boilerplate I'm seeing is the hyper parameter tuner and cross validation. I'll try and think of something and update appropriately when I can!

from mlpack.

rcurtin avatar rcurtin commented on June 18, 2024

Hi there @HeftyCoder,

The biggest place that virtual function calls would have a noticeable difference is in inner loops; one example is, e.g., the distance metric used in k-nearest-neighbor search. Another example would be a very cheap member function. To make up a slightly contrived situation, imagine mlpack had a base learner class that implemented the function GetNumberOfTrainingPoints() (and also imagine that we cached that at training time for some reason...). If a user writes a complicated function that calls GetNumberOfTrainingPoints() in some deep inner loop, then the vtable lookup penalty can be noticed there. For this and similar reasons, mlpack has generally tried to avoid dynamic polymorphism.

(Of course, it is true that for larger functions, the vtable lookup penalty is generally unnoticeable. For instance, if training the model takes even 10ms or some similarly small amount of time, one single vtable lookup at the start of that function is of negligible cost. But that is a little besides the reasoning above. I don't think there would be huge opposition to a change that also addressed the situations above, but you are right it would be a reasonably large refactoring.)

Anyway, to answer your original question more directly, I think that there are not many ways around the boilerplate here. It would be relatively easy to write a single wrapper for a machine learning method, something like this:

class MyLogisticRegression : public BaseModel
{
  MyLogisticRegression( stuff ... ) : model( stuff ... ) { }
  
  virtual void Train( stuff ... ) { model.Train( stuff ... ); }
  
  LogisticRegression model;
};

and it may even be possible to use templates...

template<typename ModelType>
class MyModel : public BaseModel
{
  MyModel( stuff ... ) : model( stuff ... ) { }
  
  virtual void Train( stuff ... ) { model.Train( stuff ... ); }

  ModelType model;
};

The trick would just be settling on standardized parameters for stuff ..., but at least mlpack's Train() and Predict() APIs are consistent across learners (at least for the data and responses).

I hope this helps; let me know if I can clarify anything. 👍

from mlpack.

HeftyCoder avatar HeftyCoder commented on June 18, 2024

Thanks for the swift reply @rcurtin. I had read the reasoning behind using static polymorphism, so I had no problem understanding its usage. What I don't understand with that decision is that everything is completely static. Just like you mentioned, there should be functions where a look-up should not matter at all. Template type constraints might actually be useful for solving these issues, but it's a C++20 thing and I believe it won't be compatible with this library for some time.

To further clarify, I also see some decisions on separation of concerns that I can't wrap my head around. For example, all the performance metrics for classification problems call Classify inside the Evaluation method. That means that there is no out of the box way to get multiple metrics without classifying again (unless I'm missing something). Is the reason to accommodate cross-validation for large k? Or because of the general philosophy of not using dynamic polymorphism?

All in all, I should probably clarify on what I plan to do. I intend to build UIs for training, validating and perhaps tuning various models. For designing the UIs for specific model parameters, dynamic polymorphism wouldn't solve the issue. I need to map specific metadata for each classifier so I can build my UIs and connect them to mlpack. Essentially, I need to be able to pass the metadata to a method and receive common interfaces for tuning, cross-validation and the model. Currently I am in search for a metadata repository, like OpenML.

Do you think it would be possible to achieve what I want with mlpack without too much boilerplate? I would also like to be able to print various performance metrics without having to re-classify the same data.

EDIT: Sorry for closing this, it was an unfortunate missclick!
EDIT2: Do you think a converter to ONNX format (if possible), would be a good idea?

from mlpack.

rcurtin avatar rcurtin commented on June 18, 2024

You are right that the performance metrics we have implemented don't have a way to compute multiple metrics at a time. Maybe that's something we could improve on in the future. The CV/HPT infrastructure was only made to optimize one metric, which is probably the reason it was written that way. (Certainly we'd be open to improvements if you wanted to make some!) That design choice seems orthogonal to the issue of static/dynamic polymorphism.

To answer your question about boilerplate, fundamentally, mlpack's preference for templates to configure an algorithm means that selecting between different algorithms at runtime (as opposed to at compile time) necessitates some amount of boilerplate. To use an example, consider the k-nearest-neighbor search class, NeighborSearch; it has lots of template parameters to configure the behavior. But we provide a command-line program that allows you to select---at runtime---the type of tree being used and other parameters that are template parameters. This means we have to write a "shim" of some sort that moves those template parameters from compile-time into runtime, which unfortunately can be a little tedious. Here is how we do that:

https://github.com/mlpack/mlpack/blob/master/src/mlpack/methods/neighbor_search/ns_model.hpp

In essence, in this situation we use dynamic polymorphism to provide a base class (NSWrapperBase) that provides a handful of common interface functions, and then we inherit from that class for different template types being used with NeighborSearch.

My best suggestion for achieving what you are looking for is to define a base class along these lines, provide the functions you need (presumably Train() and Predict()), and then make child classes for each individual learner you are looking for. If the interface is very simple, you may be able to use the second suggestion I wrote above to keep the boilerplate very short. Using that example, you could create objects like, e.g., MyModel<LogisticRegression> and MyModel<DecisionTree>, and you would only require the simple definition of MyModel.

For the metadata mapping, I think that would have to be done along the same lines---but I am not sure I understand quite enough about the problem to be able to suggest something specific.

For the ONNX converter, yes, one was written some time ago: https://github.com/sreenikSS/mlpack-Tensorflow-Translator however, it is not in a completely stable state and I'm not sure if it would currently work.

from mlpack.

rcurtin avatar rcurtin commented on June 18, 2024

As far as I know the CV and HPT parts of mlpack are a little underadvertised and underused, so I think there is some opportunity for improvement to their APIs and usability that is possible. If you have any suggestions or patches that make your life way easier---or if there is some simple and standard use case that CV/HPT does not support that it should---the feedback would definitely be welcomed. 👍

from mlpack.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.