cumf / cumf_als Goto Github PK

View Code? Open in Web Editor NEW

172.0 15.0 47.0 375 KB

CUDA Matrix Factorization Library with Alternating Least Square (ALS)

License: Apache License 2.0

Makefile 4.79% Cuda 71.05% C 14.03% Shell 0.96% C++ 5.38% Jupyter Notebook 1.02% Python 2.76%

gpu matrix-factorization als machine machine-learning cuda

cumf_als's Introduction

CuMF: CUDA-Accelerated ALS on multiple GPUs.

What is matrix factorization?

Matrix factorization (MF) factors a sparse rating matrix R (m by n, with N_z non-zero elements) into a m-by-f and a f-by-n matrices, as shown below.

Matrix factorization (MF) is at the core of many popular algorithms, e.g., collaborative filtering, word embedding, and topic model. GPU (graphics processing units) with massive cores and high intra-chip memory bandwidth sheds light on accelerating MF much further when appropriately exploiting its architectural characteristics.

What is cuMF?

CuMF is a CUDA-based matrix factorization library that optimizes alternate least square (ALS) method to solve very large-scale MF. CuMF uses a set of techniques to maximize the performance on single and multiple GPUs. These techniques include smart access of sparse data leveraging GPU memory hierarchy, using data parallelism in conjunction with model parallelism, minimizing the communication overhead among GPUs, and a novel topology-aware parallel reduction scheme.

With only a single machine with four Nvidia GPU cards, cuMF can be 6-10 times as fast, and 33-100 times as cost-efficient, compared with the state-of-art distributed CPU solutions. Moreover, cuMF can solve the largest matrix factorization problem ever reported yet in current literature.

CuMF achieves excellent scalability and performance by innovatively applying the following techniques on GPUs:

(1) On one GPU, MF deals with sparse matrices, which makes it difficult to utilize GPU's compute power. We optimize memory access in ALS by various techniques including reducing discontiguous memory access, retaining hotspot variables in faster memory, and aggressively using registers. By this means cuMF gets closer to the roofline performance of a single GPU.

(2) On multiple GPUs, we add data parallelism to ALS's inherent model parallelism. Data parallelism needs a faster reduction operation among GPUs, leading to (3).

(3) We also develop an innovative topology-aware, parallel reduction method to fully leverage the bandwidth between GPUs. By this means cuMF ensures that multiple GPUs are efficiently utilized simultaneously.

Use cuMF to accelerate Spark ALS

CuMF can be used standalone, or to accelerate the ALS implementation in Spark MLlib.

We modified Spark's ml/recommendation/als.scala (code) to detect GPU and offload the ALS forming and solving to GPUs, while retain shuffling on Spark RDD.

This approach has several advantages. First, existing apps relying on mllib/ALS need no change. Second, we leverage the best of Spark (to scale-out to multiple nodes) and GPU (to scale-up in one node). Check this GitHub project for more details. It is also a part of IBM packages for Apache Spark version 2.

Build

Type:

make clean build

To see debug message, such as the run-time of each step, type:

make clean debug

Input Data

CuMF need training and testing rating matrices in binary format, and in CSR, CSC and COO formats. In ./data/netflix and ./data/ml10M we have already prepared (i)python scripts to download Netflix and Movielens 10M data, and preprocess them, respectively.

For Netflix data, type:

cd ./data/netflix/
python ./prepare_netflix_data.py

Note: this can take 30+ minutes. You can download this file from your brower, extract and put the extracted files in ./data/netflix directly.

For Movielens:

cd ./data/ml10M/
ipython prepare_ml10M_data.py

Note: you will encounter a NaN test RMSE. Please refer to the "Known Issues" Section.

Run

Type ./main you will see the following instructions:

Usage: give M, N, F, NNZ, NNZ_TEST, lambda, X_BATCH, THETA_BATCH and DATA_DIR.

E.g., for netflix data set, use:

./main 17770 480189 100 99072112 1408395 0.048 1 3 ./data/netflix/

E.g., for movielens 10M data set, use:

./main 71567 65133 100 9000048 1000006 0.05 1 1 ./data/ml10M/

E.g., for yahooMusic dataset, use:

./main 1000990 624961 100 252800275 4003960 1.4 6 3 ./data/yahoo/

Prepare the data as instructed in the previous section, before you run.

Note: rank value F has to be a multiple of 10, e.g., 10, 50, 100, 200.

Large-Scale Problems

For Netflix data, you need to adjust the number of batches to solve X (movie features) and Theta (user features). When F is 100, we set X_BATCH and THETA_BATCH to 1 and 3, respectively. Check test_als.sh for the reference settings for different F values.

Note: we checked these settings on Kepler, Maxwell and Pascal GPU cards where there is more than 12 GB RAM. If you have cards with small memory capacity, you need to increase X_BATCH and THETA_BATCH to run more (smaller) batches.

Directory hugewiki contains the code to solve the much larger hugewiki data set. Read Section 4 of our [paper] (http://arxiv.org/abs/1603.03820) for more details.

Performance Optimization

Conjugate Gradient Solver

CuMF offers two solvers:

(1) Direct LU solver provided by cuBLAS (http://docs.nvidia.com/cuda/cublas/#cublas-lt-t-gt-getrfbatched). It requires O(n^3) computation and also the implementation on GPU is slow.

(2) Conjugate gradient method (https://en.wikipedia.org/wiki/Conjugate_gradient_method). We implement our own CG kernel.

You can use the CG instead of the LU solver, by uncomment #define USE_CG in als.cu.

Half Precision (FP16)

The CG solver can use FP16 to store the left-hand square matrix. Since the CG solver is memory-bound, this can further improve performance.

Known Issues

We are trying to improve the usability, stability and performance. Here are some known issues we are working on:

(1) NaN test error. This is because in some datasets such as movielens 10M, there are users or items with no ratings in training set but some ratings in test set. To overcome this, we have defined a flag in als.cu (#define SURPASS_NAN). If SURPASS_NAN is defined, we check NaN in calculating RMSE and ignore the NaN values. Normally #define SURPASS_NAN should be commented out, as the additional check slows down the computation.

(2) Multi GPU support. We have tested on very large data sets such as SparkALS and HugeWiki, on multiple GPUs on one server. We will make our multi GPU support code available soon.

Teams

References

More details can be found at:

Matrix Factorization on GPUs with Memory Optimization and Approximate Computing. ICPP 2018. [arXiv] (https://arxiv.org/abs/1808.03843).
Accelerate Recommender Systems with GPUs. Nvidia ParallelForAll [blog] ( https://devblogs.nvidia.com/parallelforall/accelerate-recommender-systems-with-gpus/).
CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs. Nvidia GTC 2016 talk. ppt, video
Faster and Cheaper: Parallelizing Large-Scale Matrix Factorization on GPUs. Wei Tan, Liangliang Cao, Liana Fong. [HPDC 2016], Kyoto, Japan. (arXiv)

cumf_als's People

Contributors

Stargazers

Watchers

cumf_als's Issues

Illegal memory access when k=100 for Netflix dataset

Hi,
Thanks for sharing the code.
I am using a k80 machine. The code works fine for k=10,20,..40, but not anything more than 60. Did you ever encounter that problem?
[ss@gpu04 CuMF]$ ./main 70 .058 1
F = 70, lambda = .058, THETA_BATCH = 1
*starting loading training and testing sets to host.
***start allocating memory on GPU...
***_start copying memory to GPU...
CUDA Error:
File = als.cu
Line = 844
_Reason = out of memory
////////////////////////////////////////
[ss@gpu04 CuMF]$ ./main 100 .058 1
F = 100, lambda = .058, THETA_BATCH = 1
*starting loading training and testing sets to host.
***start allocating memory on GPU...
***_start copying memory to GPU...
_CUDA failure als.cu:869: 'an illegal memory access was encountered'

Thanks,
Israt

Does CuMF support implicit feedback data?

Unable to run MovieLens10m Dataset

Hi Mr Tan,
We downloaded the movielens dataset from http://grouplens.org/datasets/movielens/10m/
We divideded the dataset into training and test dataset.

Following is the metadata for that.

m,n = 71567, 65133
nnz in train = 9301274

nnz in test = 698780

(Just to inform : we were able to run your code on the netflix dataset successfully.)
We tried running your code for the the above dataset but we are getting the following output:
./main 10 0.1 1
F = 10, lambda = 0.1, THETA_BATCH = 1
**starting loading training and testing sets to host.
***parameters: m: 71567, n: 65133, f: 10, nnz: 9301274
***start allocating memory on GPU...
******start copying memory to GPU...
--------- Train RMSE in iter 0: nan
--------- Test RMSE in iter 0: nan
--------- Train RMSE in iter 1: nan
--------- Test RMSE in iter 1: nan
--------- Train RMSE in iter 2: nan
--------- Test RMSE in iter 2: nan
--------- Train RMSE in iter 3: nan
--------- Test RMSE in iter 3: nan
--------- Train RMSE in iter 4: nan
--------- Test RMSE in iter 4: nan
--------- Train RMSE in iter 5: nan
--------- Test RMSE in iter 5: nan
--------- Train RMSE in iter 6: nan
--------- Test RMSE in iter 6: nan
--------- Train RMSE in iter 7: nan
--------- Test RMSE in iter 7: nan
--------- Train RMSE in iter 8: nan
--------- Test RMSE in iter 8: nan
--------- Train RMSE in iter 9: nan
--------- Test RMSE in iter 9: nan

doALS takes seconds: 3.000 for F= 10

ALS Done.

it cause 'core dump' on GTX950m(4G) with netflex data....

envy@ub1404:~/os_pri/github/CuMF$ tree
.
├── als.cu
├── als.h
├── als.o
├── host_utilities.cpp
├── host_utilities.h
├── host_utilities.o
├── images
│   ├── mf.png
│   └── spark-gpu.png
├── LICENSE
├── main
├── main.cu
├── main.o
├── Makefile
├── netflix_mme
├── netflix_mm.txt
├── print-test-result.sh
├── README.md
├── scripts
│   ├── prepare_input.ipynb
│   ├── R_test_coo.col.bin
│   ├── R_test_coo.data.bin
│   └── R_test_coo.row.bin
├── tensorflow
│   ├── als_tf.cc
│   ├── build_tf_op.sh
│   ├── cumf_as_tensorflow_ops_test.ipynb
│   └── cumf_as_tensorflow_ops_test.py
├── test_als.sh
└── yknote_build_debug_log

3 directories, 27 files
envy@ub1404:/os_pri/github/CuMF$
envy@ub1404:/os_pri/github/CuMF$
envy@ub1404:~/os_pri/github/CuMF$ ./main 100 0.058 3
F = 100, lambda = 0.058, THETA_BATCH = 3
*******starting loading training and testing sets to host.

loading COO...
Unable to open file!
loading CSR...
Unable to open file!
loading CSC...
Unable to open file!
loading COO Row...
Segmentation fault (core dumped)
envy@ub1404:~/os_pri/github/CuMF$

als.cu(205): error: more than one instance of overloaded function "isnan" matches the argument list

Hi.
I am using Linux
I uncommented in als.cu the line #define SURPASS_NAN
and build by
make clean build

But encountered error messages as follows:

rm -f host_utilities.o device_utilities.o als.o main main.o cg.o
/usr/local/cuda/bin/nvcc -ccbin g++ -m64 -std=c++11 -Xcompiler -DADD_ -gencode arch=compute_35,code=sm_35 -gencode arch=compute_35,code=compute_35 -lineinfo -o host_utilities.o -c host_utilities.cpp
/usr/local/cuda/bin/nvcc -ccbin g++ -m64 -std=c++11 -Xcompiler -DADD_ -gencode arch=compute_35,code=sm_35 -gencode arch=compute_35,code=compute_35 -lineinfo -o device_utilities.o -c device_utilities.cu
/usr/local/cuda/bin/nvcc -ccbin g++ -m64 -std=c++11 -Xcompiler -DADD_ -gencode arch=compute_35,code=sm_35 -gencode arch=compute_35,code=compute_35 -lineinfo -o cg.o -c cg.cu
/usr/local/cuda/bin/nvcc -ccbin g++ -m64 -std=c++11 -Xcompiler -DADD_ -gencode arch=compute_35,code=sm_35 -gencode arch=compute_35,code=compute_35 -lineinfo -o als.o -c als.cu
als.cu(205): error: more than one instance of overloaded function "isnan" matches the argument list:
function "isnan(float)"
function "std::isnan(float)"
argument types are: (float)

als.cu(205): error: more than one instance of overloaded function "isnan" matches the argument list:
function "isnan(float)"
function "std::isnan(float)"
argument types are: (float)

2 errors detected in the compilation of "/tmp/tmpxft_00006c73_00000000-7_als.cpp1.ii".

Could you please help me?
Thanks

Yahoo Music dataset

Hi,
I downloaded the Yahoo music dataset and trying to run CUMF on it. I wrote my own script to sort and load the dataset. But sorting is taking too long and getting killed. Do you happen to have the binary or a efficient script to create the binaries?

Thanks for your time.

Input data format?

It's not very clear what the input data format is.

It seems to me that there are at least several input files used by main.cu:

"./netflix/R_test_coo.data.bin"
"./netflix/R_test_coo.row.bin"
"./netflix/R_test_coo.col.bin"
"./yahoo/yahoo_R_test_coo.data.bin"
"./yahoo/yahoo_R_test_coo.row.bin"
"./yahoo/yahoo_R_test_coo.col.bin"
"./netflix/R_train_csr.data.bin"
"./netflix/R_train_csr.indptr.bin"
"./netflix/R_train_csr.indices.bin"
"./yahoo/yahoo_R_train_csr.data.bin"
"./yahoo/yahoo_R_train_csr.indptr.bin"
"./yahoo/yahoo_R_train_csr.indices.bin"
"./netflix/R_train_csc.data.bin"
"./netflix/R_train_csc.indices.bin"
"./netflix/R_train_csc.indptr.bin"
"./yahoo/yahoo_R_train_csc.data.bin"
"./yahoo/yahoo_R_train_csc.indices.bin"
"./yahoo/yahoo_R_train_csc.indptr.bin"
"./netflix/R_train_coo.row.bin"
"./yahoo/yahoo_R_train_coo.row.bin"

MemoryError

Hi,
We are trying to run your code on our machine as it is.
Our machine has a 16GB ram with free memory around 10GB.
Even then we are facing MemoryError in the following line:
train_j,train_i,train_rating = np.loadtxt(train_data_file,dtype=np.int32,skiprows=3,unpack=True)

Can you give us pointers on why and how we could be resolving this issue?
Thanks

Missing datasets

Hi, I tried creating the dataset as mentioned in prepare_netflix_data.py.
However, the URL 'http://www.select.cs.cmu.edu/code/graphlab/datasets/ seems to be invalid.
Could you please let me know a workaround?

Provide a direct link to prepared netflix dataset

The readme provides this link:

https://ibm.box.com/s/5vmh77up8reodvihiq0ri66jltg9h4uh

Unforunately, it is difficult to fetch this data without invoking a web browser. Can you provide a direct link to the file that would be suitable to pass to wget or curl, for example?

Thanks!

shall we provide a python interface?

py interface may be slower, but easier to use....

sgd?

it runs very fast, great work! just wonder did you try sgd, how much can it be optimized?

Extracting Outputs

How and where do I extract X and Theta after convergence? Are these being written to a file on the host device?

identifier "cusparseScsrmm2" is undefined

When I use the make command to build the project, an 'identifier "cusparseScsrmm2" is undefined' problem occurs. How do I solve it?

all elements in XTHost and ThetaTHost are all Nan

I ran it on ml10M dataset. And #define SURPASS_NAN is used to avoid Nan test error.

But all elements in XTHost and ThetaTHost are all NaN. Could you help me figure it out?

Thanks very much.
Best.

When rank is 70, it does not converge ······

Hi：
CuMF is very efficient. It is a amazing results.But I have three questions, the first is that RMSE does not converge when lambda is 0.05 and rank is 70,this is a very strange situation. The second is that should I process the matrix R into block form and Stored as CSR and CSC format before run the SU-ALS? The last is that can you send me the code of SU-ALS? I am very interested in this algorithm.

Issue: ./als_tf.so: undefined symbol: _ZTIN10tensorflow8OpKernelE

Hi, I am trying to run the TensorFlow example.
I first run the build script, which builds OK and creates als_tf.so and libALS.so in the current directory, but when I execute "cumf_as_tensorflow_ops_test.py" it says:

tensorflow.python.framework.errors.NotFoundError: libALS.so: cannot open shared object file: No such file or directory

I notice that in a previous closed issue " shall we provide a python interface? #1", somebody mentioned this issue, so that I follow his instruction to move the libraries to /usr/lib :

*cp .so /usr/lib

and then running it generates the following error:

File "cumf_as_tensorflow_ops_test.py", line 25, in
als_module = tf.load_op_library(lib_path)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/load_library.py", line 56, in load_op_library
lib_handle = py_tf.TF_LoadLibrary(library_filename, status)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 473, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.NotFoundError: ./als_tf.so: undefined symbol: _ZTIN10tensorflow8OpKernelE

Can you please give me a clue what went wrong?

Tensorflow interface for cuMF?

Dataset for hugewiki

Thanks for a prompt resolution of the previous query. Could you also point out the url for the hugewiki dataset?