There is only "n_device" in TGBMClassifier . But it will shutdown with n_device >1

Thanks for your work, <a class="user-mention notranslate" data-hovercard-type="user" d

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

TGBMClassifier have no Parameter “n_gpus ” , out of memory error about thundergbm HOT 7 OPEN

xtra-computing commented on May 17, 2024

TGBMClassifier have no Parameter “n_gpus ” , out of memory error

from thundergbm.

Comments (7)

zeyiwen commented on May 17, 2024 1

We have fixed the issues of using multi:softprob and n_gpus. You should see an error message saying the number of available GPUs is smaller than n_gpus, if you request using more GPUs than available. Please update thundergbm to the latest version. If the problems still exist, feel free to let us know.

Regarding the data set size, we cannot reproduce the problem. Would you please provide more information about your data set? or even better directly share the data set here.

from thundergbm.

zeyiwen commented on May 17, 2024

Thanks for the feedback. We will work on it and get back to you, once the problem is fixed. Please stay tuned.

from thundergbm.

VoyagerIII commented on May 17, 2024

Thank you very much for quickly replying and fixing the bug.
And now, the probability value can be get from Parameter set:

objective='multi:softprob'

However, when I set the "n_gpus" more than 1, thunderGBM still shutdown.
Moreover, even the "n_gpus=1", when the TRAIN DATA amount is larger it will shutdown with error:

[error == cudaSuccess] out of memory.

At last， when I del the variable "model", my GPU memory will not release.How can i release it in my code?

Thanks again.

Here is my code and data:

from __future__ import division

import numpy as np
import thundergbm

from sklearn.model_selection  import train_test_split
import numpy as np  

import gc

from scipy import sparse
from scipy.sparse import csr_matrix, hstack, vstack

import warnings
import random
warnings.filterwarnings('ignore')

label= pd.read_csv("label.csv", header = None)

csr_trainData = sparse.load_npz('csr_trainData13100.npz')
csr_trainData = csr_trainData[:, :5000]
csr_trainData.shape

trainData, valData, trainLabel, valLabel = train_test_split(csr_trainData, label.iloc[:, 1], test_size=0.2, random_state=0)

clf = thundergbm.TGBMClassifier(bagging=1, lambda_tgbm=1, learning_rate=0.07, min_child_weight=1.2, n_gpus=1, verbose=0,
                            n_parallel_trees=40, gamma=0.2, depth=7, n_trees=4000, tree_method='hist', objective='multi:softprob')

clf.fit(trainData,  trainLabel)

clf.score(valData, valLabel)

pridect = clf.predict(valData)
pridect

del clf
gc.collect()

Label and data:
https://pan.baidu.com/s/1rssIuuL3icYHsNnlWfHWew
extract code：0gux

from thundergbm.

zeyiwen commented on May 17, 2024

The code runs fine on our machine. What OS, GPUs, and CUDA version, do you use?

from thundergbm.

VoyagerIII commented on May 17, 2024

Ubuntu 18.04
NVIDIA:
NVIDIA-SMI 390.67 Driver Version: 390.67
CUDA:
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176

Can you fill up the CUDA Memory-Usage in your machine?
It's perform well in small scale. But broken in large scale Train Data.

from thundergbm.

fjgmoya commented on May 17, 2024

Thanks for your work, @zeyiwen.

As @VoyagerIII, I find a similar error. When I execute my code with 1 GPU, no problems at all. But if I set n_gpus to 2 or 3, I find an " illegal memory access was encountered". Evidently, my computer is a 3 GPU one.

It seems to ocurr at predict time: fitting is completed succesfully. I just stoped the code after fitting and before predicting.

This is the code:

import numpy as np
import sys
from thundergbm import TGBMClassifier
from sklearn import datasets as dts
from sklearn.model_selection import train_test_split

#Overall parameters
train_ratio=0.75
random_state=123457
limit=None
num_classes=10
num_estimators=10
num_parallel_trees=100
objective='multi:softmax'
max_depth=6

#number of GPU's
num_gpus=3


#Loads dataset digits
digits=dts.load_digits()
X=digits.data
y=digits.target

# Create 0.75/0.25 train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, \
        test_size=(1-train_ratio), \
        train_size=train_ratio, \
        random_state=random_state, \
        shuffle=True, \
        stratify=None)


#Classfier
clf = TGBMClassifier(objective=objective, \
        n_trees=num_estimators, \
        n_parallel_trees=num_parallel_trees, \
        n_gpus=num_gpus, \
        depth=max_depth,
        num_class=num_classes,
        tree_method='auto')
#Fitting
clf.fit(X_train, y_train)
#sys.exit(0)

#Predicting
y_pred = clf.predict(X_test)

#Score
print("Score: %10.5f"%(np.count_nonzero(np.equal(y_pred, y_test)) / y_test.shape[0]))

Ubuntu 18.04.4 TLS
NVIDIA-SMI 396.54 3 TITAN Xp GPUs
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176

Thanks.

from thundergbm.

Kurt-Liuhf commented on May 17, 2024

Hi @fjgmoya, the issue that " illegal memory access was encountered" when applying prediction on multiple GPUs is fixed. You can reinstall ThunderGBM and have a try. Thank you!

from thundergbm.

TGBMClassifier have no Parameter “n_gpus ” , out of memory error about thundergbm HOT 7 OPEN

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent