mlverse / cuda.ml Goto Github PK
View Code? Open in Web Editor NEWR interface for cuML
License: Other
R interface for cuML
License: Other
After this is done, we will accomplish the following
Rcpp::compileAttributes()
(because Rcpp expects files containing // [[Rcpp::export(...)]]
to be *.c, *.cpp, *.h, *.hpp, etc and not *.cu). Whereas after this change, files named *.cpp will actually be pure 100% C++ source files!__host__
which are CUDA-specific will end up in .cu
files which will not interfere with Rcpp::compileAttributes()
parsing the corresponding cpp files#if HAS_CUML
<run some implementation function in cuml4r namespace>
#else
#include "warn_cuml_missing.h"
#endif
which shows much clearer what HAS_CUML
does at compile time and is much better than interleaving preprocessor stuff with all sorts of other implementation details.
We should explore algorithms suffixed with '_mg' in libcuml
-- e.g., how to specify which gpus to run them on, and what performance gain there will be compared to their single-gpu equivalents
Made with pkgdown โ it's now easy to automate the whole thing with use_pkgdown_travis()
This is missing at the moment.
If you have a formula method, there are a lot of types or formulas (and in-line functions) that the users can throw at you. Rather than writing internal functions from scratch, I would either use the standard R tools (e.g. model.frame()
, model.matrix()
) or the hardhart
package.
The latter is preferred since it does a better job in some things that the standard R infrastructure has issues with. Also, if you end up using hardhat
to solve #77, you get this piece for free.
Ideally everything should compile with recent-ish versions of CUDA, but at the moment that's not the case.
Right now, a user without the proper external libraries would not get errors but would not get results either:
library(cuml4r)
# From ?cuml_Rand_forest
library(cuml4r)
# Classification
model <- cuml_rand_forest(
iris,
formula = Species ~ .,
mode = "classification",
trees = 100
)
predictions <- predict(model, iris)
str(model)
#> List of 5
#> $ mode : chr "classification"
#> $ xptr : list()
#> $ formula :Class 'formula' language Species ~ .
#> .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
#> $ resp_var_cls : chr "factor"
#> $ resp_var_attrs:List of 2
#> ..$ levels: chr [1:3] "setosa" "versicolor" "virginica"
#> ..$ class : chr "factor"
#> - attr(*, "class")= chr "cuml_rand_forest"
predictions
#> factor(0)
#> Levels: setosa versicolor virginica
Created on 2021-08-17 by the reprex package (v2.0.0)
They can't tell if this is a bug or anything else.
It looks like the package has a function for checking the installation.
Can you check on attaching the package if the proper frameworks are in order and, if they are not, write an error message that gives them directions on how to fix things?
Some cuML algorithms will work better with sparse inputs, and a bit of research is needed to find out (if possible) how to integrate sparse inputs from R with sparse inputs required by the C++ interface of cuML libs.
https://developer.nvidia.com/blog/cuda-pro-tip-generate-custom-application-profile-timelines-nvtx -- definitely something to consider
In particular, many (if not all) cuML
library routines already do this, so, if the cuml4r
functions calling those routines also do the asme, then it would be visually very easy to see the amount of overhead with data pre-processing and data transfer, etc before we call into the cuML
lib.
i.e., without copying data into a STL vector first and then from the STL vector to pinned host vector
Looks like 'hash_dir = false' plus some other configs are required for something like 'R CMD INSTALL', which works on a different tmp src directory each time, to have cache hits from ccache.
We'd like users to have fewer places to accidentally do the wrong thing. For example, in the random forest implementation:
I would remove these two arguments and find other potential pain points that users might experience.
Other suggestions in the same spirit:
max_leaves
to be Inf
instead of a nonsensical value of -1. Even if the underlying code wants -1, don't ask users to remember special magic numbers for arguments.There is a fair bit of standard R tools that can make the user experience much much better.
The model functions have x
, y
, and formula
arguments (and x
is overload with two meanings).
I suggest making S3 methods for data frames, matrices, and formula (and maybe recipes).
This can be automated using the hardhart
package it takes care of all of the scaffolding.
Is it possible for this package to install RAPIDS cuML automatically when the package is installed? The torch R package automatically installs libtorch on my Linux OS.
Currently all algorithms implemented so far interpret numeric inputs from R as double-precision floating point numbers, but there may be perf benefit if single-precision floating point arithmetic is acceptable in term of accuracy.
There is no standard type
argument so it is hard to tell what is returned from the documentation.
I'm interested to know if class probabilities are returned for the models currently in the package.
The current implementation is calling fit_predict()
to do both. Users should also be able to call predict()
separately on new data points.
Forest Inference Library (FIL) allows some tree-based ensemble model to be run on the GPU (see https://github.com/rapidsai/cuml/tree/branch-21.10/python/cuml/fil).
Python examples: https://github.com/rapidsai/cuml/blob/main/notebooks/forest_inference_demo.ipynb
Particular attention should be paid to correctly converting R factors into something that makes sense for cuML
.
https://gitlab.kitware.com/cmake/cmake/-/issues/22375 may be relevant
It would be nice to be able to interop seamlessly with models created by the Python cuml
library.
Hi @yitao-li, cool work ๐ฅณ .I know you already published on CRAN but I was just wondering if the package should be renamed {cuml} to make it consistent with existing R ML packages from the tidy/mlverse:
Given that this package is not yet widely used or advertised, I name change would be relatively 'cheap'. I know I have no stake in this what so ever, just thinking out loud. Feel free to close with no comment :-)
Thanks to Tomasz (@t-kalinowski) for helping me realize there is this edge case where the build can fail in some weird and confusing way ๐ (even though I thought cmake would just take care of everything for me, but looks like it didn't...)
Looks like we would need the R interface to launch training for multiple one-vs-rest classifiers to support classification with N > 2 categories using SVM, because currently the cuML library only supports training SVM classifier for binary classification tasks.
i.e., those implemented by cuML
that are similar to ones from Scikit-Learn
If you need some hints on where to start, usethis::use_readme_rmd()
gives the standard tidyverse template. Happy to review a PR if you need a fresh pair of eyes.
We can use SFINAE to ensure C++ source code of cuml4r
compiles with both versions of the cuML API.
Ideally R users should be able to have libcuml
& co installed without going through conda or the the laborious and potentially time-consuming process of building-everything-from-source. Looks like we are then left with a third option, which is do distribute pre-built libcuml
and other required deps.
Supported types: raft::distance::DistanceType (raft/cpp/include/raft/sparse/distance/distance.cuh)
I have been accumulating code I run to sanity-check the R interface of {cuml} works correctly in all cases. I will convert such code into testthat
test cases (soon-ish) when I have time.
related issue: #37
namely, PCA, tSVD, UMAP, random projection, and TSNE for now
I'd suggest avoiding the first three letters in the package name.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.