motefly / deepgbm Goto Github PK
View Code? Open in Web Editor NEWSIGKDD'2019: DeepGBM: A Deep Learning Framework Distilled by GBDT for Online Prediction Tasks
SIGKDD'2019: DeepGBM: A Deep Learning Framework Distilled by GBDT for Online Prediction Tasks
请问如果想实现catboost2nn,代码改动大吗,是把关于gdbt的部分替换为catboost吗(学生小白想问问,求指教)
Hi, I am trying to replicate this work for my own Dataset, which is around 0.2 Million-sized corpus and I trained the GBDT2NN model for a 5-category classification task. I found that the leaf embedding model does not perform well on the testset (accuracy is way lower than GBDT), and as a result, the GBDT2NN model targeting this pretrained leaf embedding performs even worse(30-40% decrease on accuracy).
Since the paper did not presents any evaluation on this leaf embedding learning, I wonder if you could make some clarification on a few things:
The paper presents results only for binary classification and regression, the tree number will be a lot bigger for multi-category classification tasks. So will large tree_number affects the performance? As the initial leaf embedding size is [n_clusters, max_leaves, num_classes, leaf_emb_size], larger number of trees may lead to too much variance in embedding?.
Also if the depth is deeper, leading to very complex tree structures, what adjustion would you recommand for tuning the model for better performance?
For all your datasets, the leaf embdding model learns at most 10epoches(many learns for 2epoches at most), did all of these models outperform their GBDT counterparts by only learning to predict leaf values? Did these models actually converge for such few epochs? Or more training actually hurts the performance( this happens in my experiments)
Try to work kaggle notebook but I got error KeyError: 'yearbuilt_cate'
.
Kaggle MyNotebook - deepgbm-experiments-zillow
It seems the max_ntree_per_split is not properly initiated, and it's always 0.
Traceback (most recent call last):
File "main.py", line 102, in
main()
File "main.py", line 94, in main
train_DEEPGBM(args, num_data, cate_data, plot_title, key="")
File "/Users/binrong/Desktop/Code/DeepGBM/experiments/train_models.py", line 165, in train_DEEPGBM
emb_model = EmbeddingModel(n_models, max_ntree_per_split, args.embsize, args.maxleaf+1, n_output, group_average, task=args.task).to(device)
File "/Users/binrong/Desktop/Code/DeepGBM/experiments/models/components.py", line 230, in init
stdv = math.sqrt(1.0 /(max_ntree_per_split))
ZeroDivisionError: float division by zero
RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.
Traceback (most recent call last):
File "main.py", line 100, in <module>
main()
File "main.py", line 83, in main
train_cateModels(args, cate_data, plot_title, key="")
File "C:\Users\ying\Desktop\DeepGBM\experiments\train_models.py", line 116, in train_cateModels
opt, args.max_epoch, args.batch_size, 1, key)
File "C:\Users\ying\Desktop\DeepGBM\experiments\helper.py", line 168, in TrainWithLog
test_loss, pred_y = EvalTestset(test_x, test_y, model, args.test_batch_size, test_x_opt)
File "C:\Users\ying\Desktop\DeepGBM\experiments\helper.py", line 89, in EvalTestset
return sum_loss / test_len, np.concatenate(y_preds, 0)
File "C:\Users\ying\Miniconda3\lib\site-packages\torch\tensor.py", line 458, in __array__
return self.numpy()
TypeError: can't convert CUDA tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.
Solution:
def EvalTestset(test_x, test_y, model, test_batch_size, test_x_opt=None):
test_len = test_x.shape[0]
test_num_batch = math.ceil(test_len / test_batch_size)
sum_loss = 0.0
y_preds = []
model.eval()
with torch.no_grad():
for jdx in range(test_num_batch):
tst_st = jdx * test_batch_size
tst_ed = min(test_len, tst_st + test_batch_size)
inputs = torch.from_numpy(test_x[tst_st:tst_ed].astype(np.float32)).to(device)
if test_x_opt is not None:
inputs_opt = torch.from_numpy(test_x_opt[tst_st:tst_ed].astype(np.float32)).to(device)
outputs = model(inputs, inputs_opt)
else:
outputs = model(inputs)
targets = torch.from_numpy(test_y[tst_st:tst_ed]).to(device)
if isinstance(outputs, tuple):
outputs = outputs[0]
#########################This Line#################################
y_preds.append(outputs.cpu()) # y_preds.append(outputs) -> y_preds.append(outputs.cpu())
loss_tst = model.true_loss(outputs, targets).item()
sum_loss += (tst_ed - tst_st) * loss_tst
return sum_loss / test_len, np.concatenate(y_preds, 0)
```
y_preds.append(outputs) -> y_preds.append(outputs.cpu())
当我进行二分类在线训练和预测任务的时候,出现了如下报错:
RuntimeError: reduce failed to synchronize: cudaErrorAssert: device-side assert
triggered
说是使用BCELoss的时候,tensor范围超出了[0, 1],我也上网查找了许多方法,并且对https://github.com/motefly/DeepGBM/blob/master/experiments/helper.py中TrainWithLog函数的outputs,以及https://github.com/motefly/DeepGBM/blob/master/experiments/models/components.py中true_loss函数的out分别进行了修改,但是运行结果也还是报相同的错误,想请问一下,到底是哪出了问题?
Hi, can you elaborate on the " fast version cateNN" approach? How does it work?
请问zillow的测试集是哪个呢? y_test是哪个?
Hello,
I have downloaded the flight from site http://stat-computing.org/dataexpo/2009/the-data.html
.
I find that there are 29 fields in this dataset, but 12 fields are used in your paper. I don't know the specific field in your paper is used, could you tell me the detail about how to use this dataset?
Thanks.
Hello,
I've noticed the train_DEEPGBM method will return three variables,
return deepgbm_model, opt, metric
but the main function didn't handle these outputs,
elif args.model == "deepgbm":
num_data = dh.load_data(args.data+'_num')
cate_data = dh.load_data(args.data+'_cate')
# designed for faster cateNN
cate_data = dh.trans_cate_data(cate_data)
train_DEEPGBM(args, num_data, cate_data, plot_title)
Could you tell me how to save the model for inference after training?
And I also want to know how to preprocess the data and how to feed the data to the saved model.
Thank you very much!
Can the model output variable importance feature?
Can the model use Shapley Additive explanation(SHAP) for interpretability analysis?
I tried to replicate the experiment using nipsA_deepgbm_offline script but when i tried to print the output of the predictions (i.e., pred_y), the predicted values given are real number as follows:
pred_y: [0.34316275 0.36292914 0.3552964 0.35970888 0.35918367 0.3515028
0.36127064 0.34485954 0.37243855 0.3597494 0.37358335 0.35961217
0.3446463 0.3496978 0.3566822 0.35322982 0.36830378 0.37577894
0.3774116 0.361733 0.34249693 0.36554664 0.36565682 0.3529494
0.3586624 0.35545474 0.35036924 0.37629476 0.36555907 0.37944567
0.37318298 0.37047318 0.36648336 0.3657822 0.36535558 0.37917492
0.3766928 0.37294027 0.365851 0.36476082 0.3654274 0.37421528
0.35136348 0.3707069 0.37431964 0.38356563 0.35120273 0.37153876
0.3950685 0.3748232 0.36651853 0.35673445 0.36918446 0.36931384
0.36340126 0.3641296 0.38208184 0.3779632 0.36068133 0.34996226
0.34128729 0.3601104 0.35272548 0.34229857 0.35786942 0.352486
0.34367353 0.34292746 0.3950783 0.36609793 0.3616757 0.38065642
Isn't the predicted output suppose to be 0 or 1? Can you please advise.
I used your script to generate, and found that there are no files in the nipsa offline num folder, but there are files in other folders
Hey dear author
I wants to get a copy of the orignal paper of DeepGBM: A Deep Learning Framework Distilled by GBDT for Online Prediction Tasks, is there any place to get one?
你好!代码中有两个问题:
1,models/deepgbm.py中36-40行的if语句,根据参数num_model的不同取值对self.gbdtnn进行初始化,而在forward函数第62行的else分支中,因为此时self.gbdtnn为None,63行应该会报错,不过这个一般来说没什么关系。
2,我真正想问的问题其实是这个:deepgbm.py中的67行,这里的 != 是否应该是 == ?根据论文,当num_model == ‘gbdt2nn' 时,模型输出不应该是gbdt2nn+deepfm两部分的输出吗?为何第70行的分支将deepfm的输出忽略了?
谢谢!
您好:
我在阅读您的代码的时候发现一个问题,self.gain = getItemByTree(self, 'split_gain'),这行代码应该是获取节点每次分裂的信息增益,但是在getItemByTree里面的getFeature里面并没有相对应的操作。
def getItemByTree(tree, item='split_feature'):
root = tree.raw['tree_structure']
split_nodes = tree.split_nodes
res = np.zeros(split_nodes+tree.raw['num_leaves'], dtype=np.int32)
if 'value' in item or 'threshold' in item or 'split_gain' in item:
res = res.astype(np.float64)
def getFeature(root, res):
if 'child' in item:
if 'split_index' in root:
node = root[item]
if 'split_index' in node:
res[root['split_index']] = node['split_index']
else:
res[root['split_index']] = node['leaf_index'] + split_nodes # need to check
else:
res[root['leaf_index'] + split_nodes] = -1
elif 'value' in item:
if 'split_index' in root:
res[root['split_index']] = root['internal_'+item]
else:
res[root['leaf_index'] + split_nodes] = root['leaf_'+item]
else:
if 'split_index' in root:
res[root['split_index']] = root[item]
else:
res[root['leaf_index'] + split_nodes] = -2
if 'left_child' in root:
getFeature(root['left_child'], res)
if 'right_child' in root:
getFeature(root['right_child'], res)
getFeature(root, res)
return res
preprocess/encoding_cate.py
when
import category_encoders as ce
No module named 'category_encoders'
initiated embeddings are normalized to (0, stdv). why take sqrt of inverse of feature numbers as std in
Line 77 in 8a38af4
test set 好像先在训练GBDT的时候用来做了early stop,那它可以看做是一个验证集,但是它又在训练整个DeepGBM的时候当做了测试集,我想问这俩不应该是同一个数据集吧?
最近工作需要将deepGBM的pytorch版本翻译成了tensorflow的版本,感觉是完全按照源码的翻译,然后发现在criteo的数据集上测试,只测了10分之一,使用GPU训练,发现pytorch的源码在GPU和CPU的利用率都也别高,但是相对于tensorflow就cpu和gpu的利用率就很低,导致同样的参数训练,同样使用GPU,pytorch的性能高了6倍,然后我看源码里面有的地方也有tensorflow的相关注释 比如使用tf.summary(),所以感觉作者应该也是熟悉tensorflow的,所以想问问,当时为什么选择的pytoch而没有使用tensorflow有什么原因么
I can't find the data.Would u bring them back?
I'm implementing a version for a multi-classification task, not sure where to change the work.
Is it right to change the BatchDense part with parameter out_features from "1" in original codes to n_classes in my case?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.