Git Product home page Git Product logo

cuda.ml's People

Contributors

dfalbel avatar t-kalinowski avatar yitao-li avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

cuda.ml's Issues

move algorithm implementation detail that needs to be compiled with nvcc to separate .cu{,h} files

After this is done, we will accomplish the following

  • We will no longer need the weird workarounds of setting language to 'CUDA' for numerous .cpp files in src -- Currently this is done for everything in those files to be built with nvcc and also work with Rcpp::compileAttributes() (because Rcpp expects files containing // [[Rcpp::export(...)]] to be *.c, *.cpp, *.h, *.hpp, etc and not *.cu). Whereas after this change, files named *.cpp will actually be pure 100% C++ source files!
  • Things like __host__ which are CUDA-specific will end up in .cu files which will not interfere with Rcpp::compileAttributes() parsing the corresponding cpp files
  • The cpp files will just say
#if HAS_CUML
<run some implementation function in cuml4r namespace>
#else 
#include "warn_cuml_missing.h"
#endif

which shows much clearer what HAS_CUML does at compile time and is much better than interleaving preprocessor stuff with all sorts of other implementation details.

  • Finally, GitHub will correctly show the source code of this project contains a smallish fraction of C++ and a somewhat larger fraction of CUDA. Currently almost everything shows up as C++, which is misleading!

R interface for multi-gpu algorithms

We should explore algorithms suffixed with '_mg' in libcuml -- e.g., how to specify which gpus to run them on, and what performance gain there will be compared to their single-gpu equivalents

Need package website

Made with pkgdown โ€” it's now easy to automate the whole thing with use_pkgdown_travis()

use other frameworks to process formula inputs

If you have a formula method, there are a lot of types or formulas (and in-line functions) that the users can throw at you. Rather than writing internal functions from scratch, I would either use the standard R tools (e.g. model.frame(), model.matrix()) or the hardhart package.

The latter is preferred since it does a better job in some things that the standard R infrastructure has issues with. Also, if you end up using hardhat to solve #77, you get this piece for free.

formally check for CUDA and write a sensible error message

Right now, a user without the proper external libraries would not get errors but would not get results either:

library(cuml4r)

# From ?cuml_Rand_forest

library(cuml4r)

# Classification

model <- cuml_rand_forest(
  iris,
  formula = Species ~ .,
  mode = "classification",
  trees = 100
)

predictions <- predict(model, iris)

str(model)
#> List of 5
#>  $ mode          : chr "classification"
#>  $ xptr          : list()
#>  $ formula       :Class 'formula'  language Species ~ .
#>   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
#>  $ resp_var_cls  : chr "factor"
#>  $ resp_var_attrs:List of 2
#>   ..$ levels: chr [1:3] "setosa" "versicolor" "virginica"
#>   ..$ class : chr "factor"
#>  - attr(*, "class")= chr "cuml_rand_forest"
predictions
#> factor(0)
#> Levels: setosa versicolor virginica

Created on 2021-08-17 by the reprex package (v2.0.0)

They can't tell if this is a bug or anything else.

It looks like the package has a function for checking the installation.

Can you check on attaching the package if the proper frameworks are in order and, if they are not, write an error message that gives them directions on how to fix things?

support for sparse inputs from R

Some cuML algorithms will work better with sparse inputs, and a bit of research is needed to find out (if possible) how to integrate sparse inputs from R with sparse inputs required by the C++ interface of cuML libs.

make 'R CMD INSTALL' work with ccache

Looks like 'hash_dir = false' plus some other configs are required for something like 'R CMD INSTALL', which works on a different tmp src directory each time, to have cache hits from ccache.

reduce api elements that can result in user error.

We'd like users to have fewer places to accidentally do the wrong thing. For example, in the random forest implementation:

  • The user should not have to set the mode (which defaults to classification). Since the data can be either numeric or factor, their data can be used to internally set the mode.
  • Similarly, the splitting criterion is a 1:1 relationship with mode. The function can set this internally.

I would remove these two arguments and find other potential pain points that users might experience.

Other suggestions in the same spirit:

  • Don't default to 8 streams. Users should opt-in to parallel processing. This is especially true for models that might be resampled; users make unwittingly specify 8^2 parallel works without realizing it.
  • Make the default for max_leaves to be Inf instead of a nonsensical value of -1. Even if the underlying code wants -1, don't ask users to remember special magic numbers for arguments.

Use S3 methods

There is a fair bit of standard R tools that can make the user experience much much better.

The model functions have x, y, and formula arguments (and x is overload with two meanings).

I suggest making S3 methods for data frames, matrices, and formula (and maybe recipes).

This can be automated using the hardhart package it takes care of all of the scaffolding.

install RAPIDS cuML automatically

Is it possible for this package to install RAPIDS cuML automatically when the package is installed? The torch R package automatically installs libtorch on my Linux OS.

support "single precision" mode for all ML algorithms

Currently all algorithms implemented so far interpret numeric inputs from R as double-precision floating point numbers, but there may be perf benefit if single-precision floating point arithmetic is acceptable in term of accuracy.

more documentation on prediction endpoints

There is no standard type argument so it is hard to tell what is returned from the documentation.

I'm interested to know if class probabilities are returned for the models currently in the package.

Package naming

Hi @yitao-li, cool work ๐Ÿฅณ .I know you already published on CRAN but I was just wondering if the package should be renamed {cuml} to make it consistent with existing R ML packages from the tidy/mlverse:

  • keras (and not kerasR, which is another R interface)
  • tensorflow
  • torch (and not rTorch, which is another R interface)
  • tabnet
  • ...

Given that this package is not yet widely used or advertised, I name change would be relatively 'cheap'. I know I have no stake in this what so ever, just thinking out loud. Feel free to close with no comment :-)

multi-class support for SVC

Looks like we would need the R interface to launch training for multiple one-vs-rest classifiers to support classification with N > 2 categories using SVM, because currently the cuML library only supports training SVM classifier for binary classification tasks.

Need package overview in README

If you need some hints on where to start, usethis::use_readme_rmd() gives the standard tidyverse template. Happy to review a PR if you need a fresh pair of eyes.

write tests

I have been accumulating code I run to sanity-check the R interface of {cuml} works correctly in all cases. I will convert such code into testthat test cases (soon-ish) when I have time.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.