motefly / deepgbm Goto Github PK

View Code? Open in Web Editor NEW

632.0 632.0 135.0 51 KB

SIGKDD'2019: DeepGBM: A Deep Learning Framework Distilled by GBDT for Online Prediction Tasks

Python 88.91% Shell 11.09%

deepgbm's People

Contributors

Stargazers

Watchers

Forkers

kongdzh for-competition lijiadong goucchub jxlijunhao zirui-ray-liu gokunwu jdc08161063 dreadlord1984 yun1221 nudtchengqing tandychao spencerai s12633 memen10 kummar xjdupeng zrb250 xxyy1 iwii0425 haojunyu tommylitlle zscdumin vivounicorn gossan0602 qwzhong1988 wanglilin yaoxingnihao sasasasamuel kang9779 tiffen jiaquanxiang minghao2016 opaquezxd jdjw6688 wangkhun brianlv yusk waterzxj zhouyonglong ouceduxzk niceban sunshinelium lizhimin03 angelfish91 lxw4939 wakame1367 floricaaa ultraicy tianke0711 fenxouxiaoquan jiajiadf yzu2ustc qianrenjian giraffewm meismaomao futureer evanzhu2013 githubwanghc wmmthu taipark sofq ghdeng1992 useric ssassh alaskaw hmajg bannuanma meditations whamp w55699 sm807983636 xuanyang1991 mysqlsc marsstones rwbfd smartm001 andylllll andylau20017 niuwan1 kiminh lihengtianxia chaoyue729 kailianghu beingbean xuchanguniversity hiyky wusamx wurentidai herolin12 desperatek byzhang fudp neyson nemochin sungreong davidwang673 lujunsincerely gyeongin jinlmsft

deepgbm's Issues

catboost2nn

请问如果想实现catboost2nn,代码改动大吗，是把关于gdbt的部分替换为catboost吗（学生小白想问问，求指教）

你好，我用广告点击的数据集，代码显示没有AA_train1.solution，'AA_train1.data'/AA_test1没有这些文件

Leaf Embedding Model Performance

Hi, I am trying to replicate this work for my own Dataset, which is around 0.2 Million-sized corpus and I trained the GBDT2NN model for a 5-category classification task. I found that the leaf embedding model does not perform well on the testset (accuracy is way lower than GBDT), and as a result, the GBDT2NN model targeting this pretrained leaf embedding performs even worse(30-40% decrease on accuracy).

Since the paper did not presents any evaluation on this leaf embedding learning, I wonder if you could make some clarification on a few things:

The paper presents results only for binary classification and regression, the tree number will be a lot bigger for multi-category classification tasks. So will large tree_number affects the performance? As the initial leaf embedding size is [n_clusters, max_leaves, num_classes, leaf_emb_size], larger number of trees may lead to too much variance in embedding?.
Also if the depth is deeper, leading to very complex tree structures, what adjustion would you recommand for tuning the model for better performance?
For all your datasets, the leaf embdding model learns at most 10epoches(many learns for 2epoches at most), did all of these models outperform their GBDT counterparts by only learning to predict leaf values? Did these models actually converge for such few epochs? Or more training actually hurts the performance( this happens in my experiments)

How preprocessing and encoding "Zillow Dataset"?

Try to work kaggle notebook but I got error KeyError: 'yearbuilt_cate'.
Kaggle MyNotebook - deepgbm-experiments-zillow

when I use flight dataset, I run the main.py and show follow error

max_ntree_per_split is 0

It seems the max_ntree_per_split is not properly initiated, and it's always 0.

Traceback (most recent call last):
File "main.py", line 102, in
main()
File "main.py", line 94, in main
train_DEEPGBM(args, num_data, cate_data, plot_title, key="")
File "/Users/binrong/Desktop/Code/DeepGBM/experiments/train_models.py", line 165, in train_DEEPGBM
emb_model = EmbeddingModel(n_models, max_ntree_per_split, args.embsize, args.maxleaf+1, n_output, group_average, task=args.task).to(device)
File "/Users/binrong/Desktop/Code/DeepGBM/experiments/models/components.py", line 230, in init
stdv = math.sqrt(1.0 /(max_ntree_per_split))
ZeroDivisionError: float division by zero

Can someone show an example of how to use DeepGBM with pd

RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.

when _init_ DeepGBM,got a error

when init the class DeepFM in deepfm.py,got a error

请教下，-nslices 这个参数的含义和作用是什么？

Add dot cpu method for the variable 'outputs'

Traceback (most recent call last):
  File "main.py", line 100, in <module>
    main()
  File "main.py", line 83, in main
    train_cateModels(args, cate_data, plot_title, key="")
  File "C:\Users\ying\Desktop\DeepGBM\experiments\train_models.py", line 116, in train_cateModels
    opt, args.max_epoch, args.batch_size, 1, key)
  File "C:\Users\ying\Desktop\DeepGBM\experiments\helper.py", line 168, in TrainWithLog
    test_loss, pred_y = EvalTestset(test_x, test_y, model, args.test_batch_size, test_x_opt)
  File "C:\Users\ying\Desktop\DeepGBM\experiments\helper.py", line 89, in EvalTestset
    return sum_loss / test_len, np.concatenate(y_preds, 0)
  File "C:\Users\ying\Miniconda3\lib\site-packages\torch\tensor.py", line 458, in __array__
    return self.numpy()
TypeError: can't convert CUDA tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

Solution:

def EvalTestset(test_x, test_y, model, test_batch_size, test_x_opt=None):
    test_len = test_x.shape[0]
    test_num_batch = math.ceil(test_len / test_batch_size)
    sum_loss = 0.0
    y_preds = []
    model.eval()
    with torch.no_grad():
        for jdx in range(test_num_batch):
            tst_st = jdx * test_batch_size
            tst_ed = min(test_len, tst_st + test_batch_size)
            inputs = torch.from_numpy(test_x[tst_st:tst_ed].astype(np.float32)).to(device)
            if test_x_opt is not None:
                inputs_opt = torch.from_numpy(test_x_opt[tst_st:tst_ed].astype(np.float32)).to(device)
                outputs = model(inputs, inputs_opt)
            else:
                outputs = model(inputs)
            targets = torch.from_numpy(test_y[tst_st:tst_ed]).to(device)
            if isinstance(outputs, tuple):
                outputs = outputs[0]

            #########################This Line#################################
            y_preds.append(outputs.cpu()) # y_preds.append(outputs) -> y_preds.append(outputs.cpu())


            loss_tst = model.true_loss(outputs, targets).item()
            sum_loss += (tst_ed - tst_st) * loss_tst
    return sum_loss / test_len, np.concatenate(y_preds, 0)
```

y_preds.append(outputs) -> y_preds.append(outputs.cpu())

关于二分类的在线学习和预测问题

当我进行二分类在线训练和预测任务的时候，出现了如下报错：
RuntimeError: reduce failed to synchronize: cudaErrorAssert: device-side assert
triggered
说是使用BCELoss的时候，tensor范围超出了[0, 1]，我也上网查找了许多方法，并且对https://github.com/motefly/DeepGBM/blob/master/experiments/helper.py中TrainWithLog函数的outputs，以及https://github.com/motefly/DeepGBM/blob/master/experiments/models/components.py中true_loss函数的out分别进行了修改，但是运行结果也还是报相同的错误，想请问一下，到底是哪出了问题？

func trans_cate_data in data_helpers.py for fast version cateNN

Hi, can you elaborate on the " fast version cateNN" approach? How does it work?

请问zillow的测试集是哪个呢?

请问zillow的测试集是哪个呢? y_test是哪个?

What is the feature_sizes exactly?

Dataset About Flight

Hello,

I have downloaded the flight from site http://stat-computing.org/dataexpo/2009/the-data.html.
I find that there are 29 fields in this dataset, but 12 fields are used in your paper. I don't know the specific field in your paper is used, could you tell me the detail about how to use this dataset?

Thanks.

Model saving and data prediction

Hello,
I've noticed the train_DEEPGBM method will return three variables,
return deepgbm_model, opt, metric
but the main function didn't handle these outputs,

elif args.model == "deepgbm":
        num_data = dh.load_data(args.data+'_num')
        cate_data = dh.load_data(args.data+'_cate')
        # designed for faster cateNN
        cate_data = dh.trans_cate_data(cate_data)
        train_DEEPGBM(args, num_data, cate_data, plot_title)

Could you tell me how to save the model for inference after training?
And I also want to know how to preprocess the data and how to feed the data to the saved model.

Thank you very much!

importance feature and shap

Can the model output variable importance feature?
Can the model use Shapley Additive explanation（SHAP） for interpretability analysis?

The predicted y output for the nips_A dataset is not binary

I tried to replicate the experiment using nipsA_deepgbm_offline script but when i tried to print the output of the predictions (i.e., pred_y), the predicted values given are real number as follows:

pred_y: [0.34316275 0.36292914 0.3552964 0.35970888 0.35918367 0.3515028
0.36127064 0.34485954 0.37243855 0.3597494 0.37358335 0.35961217
0.3446463 0.3496978 0.3566822 0.35322982 0.36830378 0.37577894
0.3774116 0.361733 0.34249693 0.36554664 0.36565682 0.3529494
0.3586624 0.35545474 0.35036924 0.37629476 0.36555907 0.37944567
0.37318298 0.37047318 0.36648336 0.3657822 0.36535558 0.37917492
0.3766928 0.37294027 0.365851 0.36476082 0.3654274 0.37421528
0.35136348 0.3707069 0.37431964 0.38356563 0.35120273 0.37153876
0.3950685 0.3748232 0.36651853 0.35673445 0.36918446 0.36931384
0.36340126 0.3641296 0.38208184 0.3779632 0.36068133 0.34996226
0.34128729 0.3601104 0.35272548 0.34229857 0.35786942 0.352486
0.34367353 0.34292746 0.3950783 0.36609793 0.3616757 0.38065642

Isn't the predicted output suppose to be 0 or 1? Can you please advise.

FileNotFoundError: [Errno 2] No such file or directory: 'data//nipsA_offline_num/train_features.npy'

I used your script to generate, and found that there are no files in the nipsa offline num folder, but there are files in other folders

It seems I cannot find the paper in the arxiv, is there any place to find a copy?

Hey dear author
I wants to get a copy of the orignal paper of DeepGBM: A Deep Learning Framework Distilled by GBDT for Online Prediction Tasks, is there any place to get one?

代码中的两个问题

你好！代码中有两个问题：
1，models/deepgbm.py中36-40行的if语句，根据参数num_model的不同取值对self.gbdtnn进行初始化，而在forward函数第62行的else分支中，因为此时self.gbdtnn为None，63行应该会报错，不过这个一般来说没什么关系。
2，我真正想问的问题其实是这个：deepgbm.py中的67行，这里的 != 是否应该是 == ？根据论文，当num_model == ‘gbdt2nn' 时，模型输出不应该是gbdt2nn+deepfm两部分的输出吗？为何第70行的分支将deepfm的输出忽略了？
谢谢！

关于split_gain的问题

您好：
我在阅读您的代码的时候发现一个问题，self.gain = getItemByTree(self, 'split_gain')，这行代码应该是获取节点每次分裂的信息增益，但是在getItemByTree里面的getFeature里面并没有相对应的操作。
def getItemByTree(tree, item='split_feature'):
root = tree.raw['tree_structure']
split_nodes = tree.split_nodes
res = np.zeros(split_nodes+tree.raw['num_leaves'], dtype=np.int32)
if 'value' in item or 'threshold' in item or 'split_gain' in item:
res = res.astype(np.float64)
def getFeature(root, res):
if 'child' in item:
if 'split_index' in root:
node = root[item]
if 'split_index' in node:
res[root['split_index']] = node['split_index']
else:
res[root['split_index']] = node['leaf_index'] + split_nodes # need to check
else:
res[root['leaf_index'] + split_nodes] = -1
elif 'value' in item:
if 'split_index' in root:
res[root['split_index']] = root['internal_'+item]
else:
res[root['leaf_index'] + split_nodes] = root['leaf_'+item]
else:
if 'split_index' in root:
res[root['split_index']] = root[item]
else:
res[root['leaf_index'] + split_nodes] = -2
if 'left_child' in root:
getFeature(root['left_child'], res)
if 'right_child' in root:
getFeature(root['right_child'], res)
getFeature(root, res)
return res

Where is "AA_train1.solution"?

can not find file

preprocess/encoding_cate.py
when
import category_encoders as ce

No module named 'category_encoders'

why normalize with sqrt of inverse of feature numbers?

initiated embeddings are normalized to (0, stdv). why take sqrt of inverse of feature numbers as std in

DeepGBM/models/deepfm.py

Line 77 in 8a38af4

stdv = math.sqrt(1.0 /len(self.feature_sizes))

关于测试集的问题

test set 好像先在训练GBDT的时候用来做了early stop，那它可以看做是一个验证集，但是它又在训练整个DeepGBM的时候当做了测试集，我想问这俩不应该是同一个数据集吧？

请问数据集在哪里下载，文件目录没有data文件夹

关于pytorch的相关问题

最近工作需要将deepGBM的pytorch版本翻译成了tensorflow的版本，感觉是完全按照源码的翻译，然后发现在criteo的数据集上测试，只测了10分之一，使用GPU训练，发现pytorch的源码在GPU和CPU的利用率都也别高，但是相对于tensorflow就cpu和gpu的利用率就很低，导致同样的参数训练，同样使用GPU，pytorch的性能高了6倍，然后我看源码里面有的地方也有tensorflow的相关注释比如使用tf.summary(),所以感觉作者应该也是熟悉tensorflow的，所以想问问，当时为什么选择的pytoch而没有使用tensorflow有什么原因么