mlverse / cuda.ml Goto Github PK

View Code? Open in Web Editor NEW

31.0 31.0 5.0 1.07 MB

R interface for cuML

License: Other

R 47.09% Shell 0.13% C++ 22.35% C 0.20% Cuda 30.23%

gpu machine-learning r

cuda.ml's People

Contributors

Stargazers

Watchers

Forkers

yitao-li minghao2016 nwstephens getumen ssh352

cuda.ml's Issues

move algorithm implementation detail that needs to be compiled with nvcc to separate .cu{,h} files

After this is done, we will accomplish the following

We will no longer need the weird workarounds of setting language to 'CUDA' for numerous .cpp files in src -- Currently this is done for everything in those files to be built with nvcc and also work with Rcpp::compileAttributes() (because Rcpp expects files containing // [[Rcpp::export(...)]] to be *.c, *.cpp, *.h, *.hpp, etc and not *.cu). Whereas after this change, files named *.cpp will actually be pure 100% C++ source files!
Things like __host__ which are CUDA-specific will end up in .cu files which will not interfere with Rcpp::compileAttributes() parsing the corresponding cpp files
The cpp files will just say

#if HAS_CUML
<run some implementation function in cuml4r namespace>
#else 
#include "warn_cuml_missing.h"
#endif

which shows much clearer what HAS_CUML does at compile time and is much better than interleaving preprocessor stuff with all sorts of other implementation details.

Finally, GitHub will correctly show the source code of this project contains a smallish fraction of C++ and a somewhat larger fraction of CUDA. Currently almost everything shows up as C++, which is misleading!

Single-Linkage Agglomerative Clustering

R interface for multi-gpu algorithms

We should explore algorithms suffixed with '_mg' in libcuml -- e.g., how to specify which gpus to run them on, and what performance gain there will be compared to their single-gpu equivalents

Need package website

Made with pkgdown — it's now easy to automate the whole thing with use_pkgdown_travis()

support class probabilities output for rand_forest

This is missing at the moment.

use other frameworks to process formula inputs

If you have a formula method, there are a lot of types or formulas (and in-line functions) that the users can throw at you. Rather than writing internal functions from scratch, I would either use the standard R tools (e.g. model.frame(), model.matrix()) or the hardhart package.

The latter is preferred since it does a better job in some things that the standard R infrastructure has issues with. Also, if you end up using hardhat to solve #77, you get this piece for free.

seeing '"cudaEventWaitDefault" is undefined' build error when building with CUDA 11.0

Ideally everything should compile with recent-ish versions of CUDA, but at the moment that's not the case.

optional tree-shape-string computation for FIL models

formally check for CUDA and write a sensible error message

Right now, a user without the proper external libraries would not get errors but would not get results either:

library(cuml4r)

# From ?cuml_Rand_forest

library(cuml4r)

# Classification

model <- cuml_rand_forest(
  iris,
  formula = Species ~ .,
  mode = "classification",
  trees = 100
)

predictions <- predict(model, iris)

str(model)
#> List of 5
#>  $ mode          : chr "classification"
#>  $ xptr          : list()
#>  $ formula       :Class 'formula'  language Species ~ .
#>   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
#>  $ resp_var_cls  : chr "factor"
#>  $ resp_var_attrs:List of 2
#>   ..$ levels: chr [1:3] "setosa" "versicolor" "virginica"
#>   ..$ class : chr "factor"
#>  - attr(*, "class")= chr "cuml_rand_forest"
predictions
#> factor(0)
#> Levels: setosa versicolor virginica

^{Created on 2021-08-17 by the reprex package (v2.0.0)}

They can't tell if this is a bug or anything else.

It looks like the package has a function for checking the installation.

Can you check on attaching the package if the proper frameworks are in order and, if they are not, write an error message that gives them directions on how to fix things?

linear models for regression/classification

OLS
Linear models with lasso or ridge regularization
ElasticNet
LARS
Logistic regression
Naive Bayes
SGD and others

TSNE

Some ML::fil::forest_t pointers were not freed properly

support for sparse inputs from R

Some cuML algorithms will work better with sparse inputs, and a bit of research is needed to find out (if possible) how to integrate sparse inputs from R with sparse inputs required by the C++ interface of cuML libs.

generate custom application profile timelines with PUSH_RANGE, etc

https://developer.nvidia.com/blog/cuda-pro-tip-generate-custom-application-profile-timelines-nvtx -- definitely something to consider

In particular, many (if not all) cuML library routines already do this, so, if the cuml4r functions calling those routines also do the asme, then it would be visually very easy to see the amount of overhead with data pre-processing and data transfer, etc before we call into the cuML lib.

enable direct copy from Rcpp vector to pinned host vector

i.e., without copying data into a STL vector first and then from the STL vector to pinned host vector

make 'R CMD INSTALL' work with ccache

Looks like 'hash_dir = false' plus some other configs are required for something like 'R CMD INSTALL', which works on a different tmp src directory each time, to have cache hits from ccache.

reduce api elements that can result in user error.

We'd like users to have fewer places to accidentally do the wrong thing. For example, in the random forest implementation:

The user should not have to set the mode (which defaults to classification). Since the data can be either numeric or factor, their data can be used to internally set the mode.
Similarly, the splitting criterion is a 1:1 relationship with mode. The function can set this internally.

I would remove these two arguments and find other potential pain points that users might experience.

Other suggestions in the same spirit:

Don't default to 8 streams. Users should opt-in to parallel processing. This is especially true for models that might be resampled; users make unwittingly specify 8^2 parallel works without realizing it.
Make the default for max_leaves to be Inf instead of a nonsensical value of -1. Even if the underlying code wants -1, don't ask users to remember special magic numbers for arguments.

Use S3 methods

There is a fair bit of standard R tools that can make the user experience much much better.

The model functions have x, y, and formula arguments (and x is overload with two meanings).

I suggest making S3 methods for data frames, matrices, and formula (and maybe recipes).

This can be automated using the hardhart package it takes care of all of the scaffolding.

install RAPIDS cuML automatically

Is it possible for this package to install RAPIDS cuML automatically when the package is installed? The torch R package automatically installs libtorch on my Linux OS.

SHAP kernel explainer

support "single precision" mode for all ML algorithms

Currently all algorithms implemented so far interpret numeric inputs from R as double-precision floating point numbers, but there may be perf benefit if single-precision floating point arithmetic is acceptable in term of accuracy.

integration with `future` framework in R

split some translation units under src into more granular ones

Making all the classifier / regressor code paths a bit more maintainable
Should improve build speed a bit if parallel build is enable

support serialization and deserialization of models (pt 1)

support separate fit and predict functions for kmeans and dbscan algorithms

The current implementation is calling fit_predict() to do both. Users should also be able to call predict() separately on new data points.

R interface for FIL

Forest Inference Library (FIL) allows some tree-based ensemble model to be run on the GPU (see https://github.com/rapidsai/cuml/tree/branch-21.10/python/cuml/fil).

Python examples: https://github.com/rapidsai/cuml/blob/main/notebooks/forest_inference_demo.ipynb

R interface for random forest algorithms

Particular attention should be paid to correctly converting R factors into something that makes sense for cuML.

set CMAKE_CUDA_ARCHITECTURES flag

https://gitlab.kitware.com/cmake/cmake/-/issues/22375 may be relevant

load and save pickled models using reticulate

It would be nice to be able to interop seamlessly with models created by the Python cuml library.

Package naming

Hi @yitao-li, cool work 🥳 .I know you already published on CRAN but I was just wondering if the package should be renamed {cuml} to make it consistent with existing R ML packages from the tidy/mlverse:

keras (and not kerasR, which is another R interface)
tensorflow
torch (and not rTorch, which is another R interface)
tabnet
...

Given that this package is not yet widely used or advertised, I name change would be relatively 'cheap'. I know I have no stake in this what so ever, just thinking out loud. Feel free to close with no comment :-)

fail fast with a sensible error message if `nvcc` is not on the path

Thanks to Tomasz (@t-kalinowski) for helping me realize there is this edge case where the build can fail in some weird and confusing way 😅 (even though I thought cmake would just take care of everything for me, but looks like it didn't...)

replace some code in initial prototype with modern C++ constructs from thrust

multi-class support for SVC

Looks like we would need the R interface to launch training for multiple one-vs-rest classifiers to support classification with N > 2 categories using SVM, because currently the cuML library only supports training SVM classifier for binary classification tasks.

R interface for GPU-accelarated preprocessing routines

i.e., those implemented by cuML that are similar to ones from Scikit-Learn

support vector machine classifier & regressor

integrate with {parsnip}

Need package overview in README

If you need some hints on where to start, usethis::use_readme_rmd() gives the standard tidyverse template. Happy to review a PR if you need a fresh pair of eyes.

'umapParam::target_weights' was renamed to 'umapParam::target_weight' in recent versions of cuML

We can use SFINAE to ensure C++ source code of cuml4r compiles with both versions of the cuML API.

explore reasonable distribution channels of libcuml & co aside from conda and build-from-source

Ideally R users should be able to have libcuml & co installed without going through conda or the the laborious and potentially time-consuming process of building-everything-from-source. Looks like we are then left with a third option, which is do distribute pre-built libcuml and other required deps.

Holt-Winters Exponential Smoothing
ARIMA