Git Product home page Git Product logo

Comments (10)

PseudoProgrammer avatar PseudoProgrammer commented on May 22, 2024

ps:之所以倾向于构建训练集A,是因为应用场景的特征高维稀疏。训练集A可以降低硬盘存储和训练模型的内存

from gbdt.

qiyiping avatar qiyiping commented on May 22, 2024

你好,

训练数据格式有这么一个约定:如果某一维特征没有出现在样本中,则认为这条样本中这一维特征为缺失(N/A,not available)。

树模型拟合过程中,也会对缺失值进行特殊的处理。

所以目前数据格式中对于特征值为0的情况也得显示的标明。

另外,从训练复杂度来看的话,树模型的训练时间和特征维度成正比,如果特征维度特别高的话,树模型的训练时间会特别大,可以考虑尝试其他模型(LR、FM、FFM等)

from gbdt.

PseudoProgrammer avatar PseudoProgrammer commented on May 22, 2024

你好,
1、由于一些场景需要,其他模型可能不太适合,需要这个c++的gbdt
2、这两份数据集A和B的信息量是一样,所以期待两分数据抛出来的auc效果是一样的。因为B只是对A的所有缺失值补0而已,信息量没有增加
所以,想问下当前的gbdt对missing value是怎样处理的,以及,有没有推荐的方案供参考

from gbdt.

PseudoProgrammer avatar PseudoProgrammer commented on May 22, 2024

@qiyiping

from gbdt.

PseudoProgrammer avatar PseudoProgrammer commented on May 22, 2024

比如xgboost的方案是把missing value的样本全分到左孩子节点

from gbdt.

qiyiping avatar qiyiping commented on May 22, 2024

你好,这个实现中,是用一颗“三叉树”来处理missing value情况的:左右子节点+NaN节点。

所以对数据格式有了这样的要求。

如果你期望实现sparse的数据格式,可以简单修改一下数据加载模块:

result->feature[i] = kUnknownValue;

将初始值从kUnknownValue改为0即可以了。

希望能帮助到你。

多谢

from gbdt.

PseudoProgrammer avatar PseudoProgrammer commented on May 22, 2024

好的,另外有个疑问哈,如果是三叉树的话,训练集A的auc为什么不是1呢

from gbdt.

PseudoProgrammer avatar PseudoProgrammer commented on May 22, 2024

特征0 非miss即判断为0类别,miss即判断为1类别

from gbdt.

qiyiping avatar qiyiping commented on May 22, 2024

from gbdt.

PseudoProgrammer avatar PseudoProgrammer commented on May 22, 2024

okay,了解了,谢谢!那我把所有miss的值用计算机的最小值补全,则miss值会分配到左孩子节点。你觉得这样做会不会有什么问题呢?

from gbdt.

Related Issues (5)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.