lambdaji / tf_repos Goto Github PK

View Code? Open in Web Editor NEW

679.0 44.0 317.0 5.82 MB

TensorFlow Script

Python 90.69% C++ 4.71% Shell 4.61%

deep-learning ctr-prediction tensorflow multi-task-learning pnn deepfm wide-and-deep din

tf_repos's People

Contributors

Stargazers

Watchers

Forkers

catcoding123 cnstevenyu lovehoroscoper jjdblast ghostchef lgstd killallkill yuyichen09 yigenliang babylls longxinchen hanguoliang ambier shaoguangcheng penalizedem luoyexuge iamhere1 ringwraith xiqiangucas rollingdeep howieeeeeeee liuzhiyong01 chenglongchen wlcoolongs duhangnju anpark lzjtt2017 hecongqing alphasue elfaxiao kekeburning jieming2002 huangpingchun cutecha ethaniz running-bad-ai ericxsun icaffe junzh821 605883732 zpilgrim kai2020-hello shixw1991 xiaomaohoujiao2 shuiliwanwu elliottqian tandychao mathlf2015 gzpan cstur4 djrjm rongyousu sendlerlee zsommer snowcement ck8275411 xkzju zbn123 xielm12 zgcgreat wanglianjing hellokang martin6336 nifsupreme yunxileo dashilu1573 xxyy1 xiaoya1220 wonderlzy clintlu wim666 shengyi-pan gavinzjchao hitflame yongchengtao tiffen daicoolb gaoyongcn eryueniaobp godotfdupku chengli0327 bellagao1023 solielpai yipeng5 zhangyingxia matrixj bigbear2017 zhq99 aa147138 cdmawow zhangguanlv hhappy06 yangfengkaust mdiby murdonson zhenglei2015 hhh920406 pulin-zou yueyedeai zhouyonglong

tf_repos's Issues

tf distribute

when i run the deepfm in the distribute mode， an error happened:
No worker known as /job:chief/replica:0/task:0
could you help me~

DeepFM算法可以直接读原始训练数据来训练吗，而不是先转成libsvm。

原始数据如：
1 a,b,c,0.1,0.2,0.3
类似于这种？

关于wide_deep模型的数据处理

拜读了您的大作，有一个问题想请教下：我看您这边的data process逻辑，都是做成了“类似FM二阶部分, 统一做embedding, <id, val> 离散特征val=1.0”的libsvm格式，但是根据您的wide&deep的模型代码，input_fn中是直接parse csv，是否没有按照libsvm格式来处理？

建议可以加个readme
比如get_criteo_feature.py默认测试集的所有特征都在训练集出现过，否则feature_map不全；
比如测试的数据不能太少，不然cutoff都没了；
比如测试集这里跟训练集这里下标差一：val = dists.gen(i, features[continous_features[i] - 1])，然后我改成跟训练集一样的下标了，应该是我的数据格式测试集合训练集是一样的，博主的两者数据坐标差一？测试集的label = features[0]我也加上去了，这样后面对比测试效果应该能更加方便对比，不然延用训练集的最后一个label感觉怪怪的；
比如数值型连续值不能只有一个唯一值，否则归一化出错；
...........

an error

DeepFM Infer：No such file or directory: '~/deepFM_ex2/data/criteo/pred.txt
How can I get this file？（I have already run feature.py and model_train.py, no error has been reported）

PNN没有进行分解降维？

你好，我看了你的代码，感觉并没有对IPNN做的product做分解是么？还是我的理解有问题？谢谢

wide_n_deep.py 中设置隐层节点数

hidden_units = map(int, FLAGS.deep_layers.split(","))
改为
hidden_units = list(map(int, FLAGS.deep_layers.split(",")))

ESMM数据集上的全局ID为何不统一

有个疑问，请问特征109_14和特征206上的ID为啥是不统一的，理论上如果是同一个商品类目，ID不应该是一致的吗，我统计了其中40W条数据，发现这两个特征里出现的ID并没有交集，请问是我理解错了这个数据集的含义吗，有人可以解答吗，感谢！

为什么不尝试使用feature_columns构建embedding特征呢？感觉会更方便。

分布式训练

再请教您一个问题
work1 一直在等待INFO:tensorflow:Waiting 1800.000000 secs before starting eval.
WARNING:tensorflow:Estimator is not trained yet. Will start an evaluation when a checkpoint is ready.

work'-0
日志的一部分
INFO:tensorflow:Saving checkpoints for 25076 into /workspace/wlc/model_dir/model.ckpt.
INFO:tensorflow:global_step/sec: 7.5244
E0712 18:25:49.778093 ProjectorPluginIsActiveThread saver.py:1817] Couldn't match files for checkpoint /workspace/wlc/model_dir/model.ckpt-25076
INFO:tensorflow:loss = 84.10749, average_loss = 0.65708977 (31.580 sec)
INFO:tensorflow:loss = 84.10749, step = 25285 (31.580 sec)
E0712 18:26:20.777756 ProjectorPluginIsActiveThread saver.py:1817] Couldn't match files for checkpoint /workspace/wlc/model_dir/model.ckpt-25076
INFO:tensorflow:loss = 81.15384, average_loss = 0.63401437 (25.918 sec)

work-1 一直在等待

TensorBoard 1.6.0 at http://tensorflow-wanglianchen-144-16-worker-1-0grc9:6006 (Press CTRL+C to quit)
INFO:tensorflow:TF_CONFIG environment variable: {'cluster': {'worker': ['tensorflow-wanglianchen-144-16-worker-2:2222'], 'ps': ['tensorflow-wanglianchen-144-16-ps-0:2222'], 'chief': ['tensorflow-wanglianchen-144-16-worker-0:2222']}, 'task': {'type': 'evaluator', 'index': 0}}
INFO:tensorflow:Using config: {'_num_worker_replicas': 0, '_num_ps_replicas': 0, '_global_id_in_cluster': None, '_master': '', '_save_checkpoints_steps': 1000, '_session_config': device_count {
key: "CPU"
value: 1
}
device_count {
key: "GPU"
}
, '_keep_checkpoint_every_n_hours': 10000, '_save_summary_steps': 1000, '_keep_checkpoint_max': 5, '_log_step_count_steps': 1000, '_service': None, '_save_checkpoints_secs': None, '_is_chief': False, '_tf_random_seed': None, '_model_dir': '/workspace/wlc/model_dir/', '_evaluation_master': '', '_task_id': 0, '_cluster_spec': , '_task_type': 'evaluator'}
INFO:tensorflow:Waiting 1800.000000 secs before starting eval.
WARNING:tensorflow:Estimator is not trained yet. Will start an evaluation when a checkpoint is ready.
INFO:tensorflow:Waiting 1799.999588 secs before starting next eval run.
WARNING:tensorflow:Estimator is not trained yet. Will start an evaluation when a checkpoint is ready.
INFO:tensorflow:Waiting 1799.999654 secs before starting next eval run.
WARNING:tensorflow:Estimator is not trained yet. Will start an evaluation when a checkpoint is ready.
INFO:tensorflow:Waiting 1799.999693 secs before starting next eval run.
WARNING:tensorflow:Estimator is not trained yet. Will start an evaluation when a checkpoint is ready.
INFO:tensorflow:Waiting 1799.999667 secs before starting next eval run.
WARNING:tensorflow:Estimator is not trained yet. Will start an evaluation when a checkpoint is ready.
INFO:tensorflow:Waiting 1799.999685 secs before starting next eval run.

work-2运行成功

INFO:tensorflow:loss = 84.003555, average_loss = 0.6562778 (26.016 sec)
INFO:tensorflow:loss = 84.003555, step = 25914 (26.016 sec)
INFO:tensorflow:Loss for final step: 84.82182.
ps_host ['tensorflow-wanglianchen-144-16-ps-0:2222']
worker_host ['tensorflow-wanglianchen-144-16-worker-2:2222']
chief_hosts ['tensorflow-wanglianchen-144-16-worker-0:2222']
{"task": {"index": 0, "type": "worker"}, "cluster": {"ps": ["tensorflow-wanglianchen-144-16-ps-0:2222"], "worker": ["tensorflow-wanglianchen-144-16-worker-2:2222"], "chief": ["tensorflow-wanglianchen-144-16-worker-0:2222"]}}
model_type:wide_deep
train_samples_num:3000000
Parsing /workspace/wlc/wide_deep_dist/data/train.csv
1.0hours
task train success.
modeldir=/workspace/wlc,modelname=model_dir

ps—0 日志
start checkWorkerIsFinish
TensorBoard 1.6.0 at http://tensorflow-wanglianchen-144-16-ps-0-jrngn:6006 (Press CTRL+C to quit)
INFO:tensorflow:TF_CONFIG environment variable: {'cluster': {'worker': ['tensorflow-wanglianchen-144-16-worker-2:2222'], 'ps': ['tensorflow-wanglianchen-144-16-ps-0:2222'], 'chief': ['tensorflow-wanglianchen-144-16-worker-0:2222']}, 'task': {'type': 'ps', 'index': 0}}
INFO:tensorflow:Using config: {'_cluster_spec': , '_task_id': 0, '_model_dir': '/workspace/wlc/model_dir/', '_service': None, '_session_config': device_count {
key: "CPU"
value: 1
}
device_count {
key: "GPU"
}
, '_save_summary_steps': 1000, '_is_chief': False, '_save_checkpoints_secs': None, '_master': 'grpc://tensorflow-wanglianchen-144-16-ps-0:2222', '_global_id_in_cluster': 2, '_evaluation_master': '', '_keep_checkpoint_max': 5, '_save_checkpoints_steps': 1000, '_task_type': 'ps', '_tf_random_seed': None, '_num_worker_replicas': 2, '_log_step_count_steps': 1000, '_num_ps_replicas': 1, '_keep_checkpoint_every_n_hours': 10000}
INFO:tensorflow:Start Tensorflow server.
2018-07-12 17:26:33.154403: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-07-12 17:26:33.160418: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job chief -> {0 -> tensorflow-wanglianchen-144-16-worker-0:2222}
2018-07-12 17:26:33.160444: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:2222}
2018-07-12 17:26:33.160463: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> tensorflow-wanglianchen-144-16-worker-2:2222}
2018-07-12 17:26:33.164749: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:2222

CPU集群Saving checkpoints error

请问在CPU集群运行分布式TF的时候遇到这个问题是咋回事？有啥解决办法吗？
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
could not find method isEncrypted from class org/apache/hadoop/fs/FileStatus with signature ()Z
hdfsGetPathInfo(/user/tdw_gilbertchen/model_path/test/2019080400): getFileInfo error:
java.lang.NoSuchMethodError: isEncrypted
INFO:tensorflow:Graph was finalized.
2019-09-04 11:14:49.486416: I tensorflow/core/distributed_runtime/master_session.cc:1161] Start master session 239ec7870b717670 with config: gpu_options { }
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into hdfs://ss-wxg-3-v2/user/tdw_gilbertchen/model_path/test/2019080400/model.ckpt.
Traceback (most recent call last):
File "/data/user/code/mainRun.py", line 150, in
tf.app.run()
File "/usr/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "/data/user/code/mainRun.py", line 137, in main
tf.estimator.train_and_evaluate(model, train_spec, eval_spec)
File "/usr/lib/python2.7/site-packages/tensorflow/python/estimator/training.py", line 471, in train_and_evaluate
return executor.run()
File "/usr/lib/python2.7/site-packages/tensorflow/python/estimator/training.py", line 637, in run
getattr(self, task_to_run)()
File "/usr/lib/python2.7/site-packages/tensorflow/python/estimator/training.py", line 642, in run_chief
return self._start_distributed_training()
File "/usr/lib/python2.7/site-packages/tensorflow/python/estimator/training.py", line 788, in _start_distributed_training
saving_listeners=saving_listeners)
File "/usr/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 354, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 1207, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/usr/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 1241, in _train_model_default
saving_listeners)
File "/usr/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 1468, in _train_with_estimator_spec
log_step_count_steps=log_step_count_steps) as mon_sess:
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 504, in MonitoredTrainingSession
stop_grace_period_secs=stop_grace_period_secs)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 921, in init
stop_grace_period_secs=stop_grace_period_secs)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 643, in init
self._sess = _RecoverableSession(self._coordinated_creator)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1107, in init
_WrappedSession.init(self, self._create_session())
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1112, in _create_session
return self._sess_creator.create_session()
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 807, in create_session
hook.after_create_session(self.tf_sess, self.coord)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/basic_session_run_hooks.py", line 568, in after_create_session
self._save(session, global_step)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/basic_session_run_hooks.py", line 599, in _save
self._get_saver().save(session, self._save_path, global_step=step)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1441, in save
{self.saver_def.filename_tensor_name: checkpoint_file})
File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: hdfs://ss-wxg-3-v2/user/tdw_gilbertchen/model_path/test/2019080400/model.ckpt-0_temp_6e34385f8bd846499e635fda38324771/part-00000-of-00001.index; Unknown error 255
[[node save/MergeV2Checkpoints (defined at /data/user/code/mainRun.py:137) = MergeV2Checkpoints[delete_old_dirs=true, _device="/job:ps/replica:0/task:0/device:CPU:0"](save/MergeV2Checkpoints/checkpoint_prefixes, _recv_save/Const_0_S581)]]
[[{{node save/Identity_S583}} = _HostRecvclient_terminated=false, recv_device="/job:chief/replica:0/task:0/device:CPU:0", send_device="/job:ps/replica:0/task:0/device:CPU:0", send_device_incarnation=-6548387880355174373, tensor_name="edge_302_save/Identity", tensor_type=DT_STRING, _device="/job:chief/replica:0/task:0/device:CPU:0"]]

Caused by op u'save/MergeV2Checkpoints', defined at:
File "/data/user/code/mainRun.py", line 150, in
tf.app.run()
File "/usr/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "/data/user/code/mainRun.py", line 137, in main
tf.estimator.train_and_evaluate(model, train_spec, eval_spec)
File "/usr/lib/python2.7/site-packages/tensorflow/python/estimator/training.py", line 471, in train_and_evaluate
return executor.run()
File "/usr/lib/python2.7/site-packages/tensorflow/python/estimator/training.py", line 637, in run
getattr(self, task_to_run)()
File "/usr/lib/python2.7/site-packages/tensorflow/python/estimator/training.py", line 642, in run_chief
return self._start_distributed_training()
File "/usr/lib/python2.7/site-packages/tensorflow/python/estimator/training.py", line 788, in _start_distributed_training
saving_listeners=saving_listeners)
File "/usr/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 354, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 1207, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/usr/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 1241, in _train_model_default
saving_listeners)
File "/usr/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 1468, in _train_with_estimator_spec
log_step_count_steps=log_step_count_steps) as mon_sess:
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 504, in MonitoredTrainingSession
stop_grace_period_secs=stop_grace_period_secs)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 921, in init
stop_grace_period_secs=stop_grace_period_secs)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 643, in init
self._sess = _RecoverableSession(self._coordinated_creator)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1107, in init
_WrappedSession.init(self, self._create_session())
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1112, in _create_session
return self._sess_creator.create_session()
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 800, in create_session
self.tf_sess = self._session_creator.create_session()
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 557, in create_session
self._scaffold.finalize()
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 215, in finalize
self._saver.build()
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1114, in build
self._build(self._filename, build_save=True, build_restore=True)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1151, in _build
build_save=build_save, build_restore=build_restore)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 786, in _build_internal
save_tensor = self._AddShardedSaveOps(filename_tensor, per_device)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 369, in _AddShardedSaveOps
return self._AddShardedSaveOpsForV2(filename_tensor, per_device)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 351, in _AddShardedSaveOpsForV2
sharded_prefixes, checkpoint_prefix, delete_old_dirs=True)
File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/gen_io_ops.py", line 473, in merge_v2_checkpoints
delete_old_dirs=delete_old_dirs, name=name)
File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
op_def=op_def)
File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1770, in init
self._traceback = tf_stack.extract_stack()

UnknownError (see above for traceback): hdfs://ss-wxg-3-v2/user/tdw_gilbertchen/model_path/test/2019080400/model.ckpt-0_temp_6e34385f8bd846499e635fda38324771/part-00000-of-00001.index; Unknown error 255
[[node save/MergeV2Checkpoints (defined at /data/user/code/mainRun.py:137) = MergeV2Checkpoints[delete_old_dirs=true, _device="/job:ps/replica:0/task:0/device:CPU:0"](save/MergeV2Checkpoints/checkpoint_prefixes, _recv_save/Const_0_S581)]]
[[{{node save/Identity_S583}} = _HostRecvclient_terminated=false, recv_device="/job:chief/replica:0/task:0/device:CPU:0", send_device="/job:ps/replica:0/task:0/device:CPU:0", send_device_incarnation=-6548387880355174373, tensor_name="edge_302_save/Identity", tensor_type=DT_STRING, _device="/job:chief/replica:0/task:0/device:CPU:0"]]

关于aliccp数据集

请问阿里ccp数据集
你是否做了下采样我过滤掉非法数据还有近三千万行数据
压缩成gz也有30多g

criteo dataset link not work

https://github.com/lambdaji/tf_repos/tree/master/deep_ctr#how-to-use

should update the link

-> 实验数据集用criteo，特征工程参考here

DCN能否区分下 continous feature 和categorical feature 呢~

目前的实现在最终效果上有一些问题。。

where to get the criteo dataset

训练时，CVR_AUC一直是0.5左右

是不是训练数据太不平衡了，大部分label z都是0，是否需要采样后训练？

DCN模型：field_size、feature_size等参数导致的reshape、embedding_lookup等问题

请问DCN模型的代码不能直接用于criteo数据集吗？还是要运行的时候传哪几个参数？
我看代码里面默认field_size是0，这里必须要在运行时候传参吧，比如我的是field_size=2496 ？不传参的话“ feat_vals = tf.reshape(feat_vals, shape=[-1, field_size, 1])”这里reshape成（-1,0）就报错了：
Reshape cannot infer unless all specified input sizes are non-zero。
但是传参的话，后面feature_size代码里面默认也是0，且没有计算新值赋值，导致变成Feat_Emb维度是(0, 32)，然后又引起
embeddings = tf.nn.embedding_lookup(Feat_Emb, feat_ids) # None * F * K
有“indices[0,1872] = 1 is not in [0, 0)”的错误。
然后我根据tr.libsvm每行有39个空格、feature_map文件有428行，改了下面两个默认参数：
tf.app.flags.DEFINE_integer("feature_size", 428, "Number of features")
tf.app.flags.DEFINE_integer("field_size", 39, "Number of fields")
但是还是有各种维度对不上：
Assign requires shapes of both tensors to match. lhs shape= [1312,1] rhs shape= [79936,1]

tensorflow serving client编译运行问题

tf_repos/deep_ctr/Serving_pipeline/deep_fm_serving_client.cpp编译时碰到很多依赖问题，能否提供一下client的编译脚本，以及运行过程

【链接失效】实验数据集用criteo，特征工程参考here

这个链接失效了可以再给一个嘛

DeepFM gpu利用率问题

lambdaji你好～一直有关注你的知乎和tf_repos，�最近在实践中利用DeepFM实现了一个排序模型，想请教一个实际运用的问题，请问你在实际运用中是否会出现GPU利用率的问题？我这边在训练过程利用率始终在10%以下，如果单GPU资源利用都达不到100%，分布式也就没意义了。。。我用的是Tesla P40，显存有24G，显存应该不是瓶颈，数据规模field有81，�feature index大约是百万级，对利用率问题一直不解，还望指教，多谢！

deepFM中的field_size

你好，我看代码中的field_size是固定死的，但实际中如果遇到每行的field 大小不确定，因为是稀疏的, 所以就不能直接reshape了，请问有相应的解决方案吗？
feat_ids = features['feat_ids']
feat_ids = tf.reshape(feat_ids,shape=[-1,field_size])
feat_vals = features['feat_vals']
feat_vals = tf.reshape(feat_vals,shape=[-1,field_size])

criteo数据集在哪里获取呀？能贡献下吗，paddlepaddle中kaggle里面的数据集链接已经失效了，非常感谢！！

请问对于类别特征多值的情况有处理吗?

DeepFM效果比FNN差

你好，非常感谢能提供相关CTR训练模型代码，在我的实验结果上DeepFM 比FNN在千分位低5个百分点，理论上讲不应该是高吗，希望可以给出意见，谢谢

DeepFm的训练过程一直输出Parsing ['../../data/criteo/tr.libsvm']

模型不断输出
Parsing ['../../data/criteo/tr.libsvm']
INFO:tensorflow:Calling model_fn.
是什么原因？是因为没有设置max_steps吗？

感觉不太合理啊，即使运行多个epoch，也不应该每次都调用model_fn以及解析数据啊？

分布式执行时出现Reshape error

Caused by op u'Reshape', defined at:
File "DeepFM.py", line 392, in
tf.app.run()
File "/data/hadoop/local/usercache/test/appcache/application_5145270655_21212399/container_1569565_99362122/Python/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "DeepFM.py", line 326, in main
tf.estimator.train_and_evaluate(DeepFM, train_spec, eval_spec)
File "DeepFM.py", line 128, in model_fn
feat_ids = tf.reshape(feat_ids, shape=[-1, field_size])

InvalidArgumentError (see above for traceback): Reshape cannot infer the missing input size for an empty tensor unless all specified input sizes are non-zero
[[Node: Reshape = Reshape[T=DT_INT32, Tshape=DT_INT32, _device="/job:chief/replica:0/task:0/device:CPU:0"](IteratorGetNext, Reshape/shape)]]
[[Node: gradients/Deep-part/deep_out/MatMul_grad/tuple/control_dependency_1_S313 = _Recvclient_terminated=false, recv_device="/job:ps/replica:0/task:0/device:CPU:0", send_device="/job:chief/replica:0/task:0/device:CPU:0", send_device_incarnation=-1178756093214127197, tensor_name="edge_1006_gradients/Deep-part/deep_out/MatMul_grad/tuple/control_dependency_1", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:0/device:CPU:0"]]

有个小疑问，为什么我增加batch size GPU内存却没有增加。batch size = 100000

按理batch size增加同时运算的GPU需要内存也会增加

导出模型serving文件报错

Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Key Deep-part/mlp2/biases not found in checkpoint
[[node save/RestoreV2 (defined at Model_pipeline/DeepFM.py:366) ]]

how to use tensorborad in this frame

W0813 17:48:05.416769 Reloader plugin_event_accumulator.py:303] Found more than one graph event per run, or there was a metagraph containing a graph_def, as well as one or more graph events. Overwriting the graph with the newest event. W0813 17:48:05.417315 Reloader plugin_event_accumulator.py:311] Found more than one metagraph event per run. Overwriting the metagraph with the newest event. W0813 17:48:05.419010 Reloader plugin_event_accumulator.py:303] Found more than one graph event per run, or there was a metagraph containing a graph_def, as well as one or more graph events. Overwriting the graph with the newest event. W0813 17:48:05.419358 Reloader plugin_event_accumulator.py:311] Found more than one metagraph event per run. Overwriting the metagraph with the newest event. W0813 17:48:05.420542 Reloader plugin_event_accumulator.py:303] Found more than one graph event per run, or there was a metagraph containing a graph_def, as well as one or more graph events. Overwriting the graph with the newest event. W0813 17:48:05.421025 Reloader plugin_event_accumulator.py:311] Found more than one metagraph event per run. Overwriting the metagraph with the newest event. W0813 17:48:05.422015 Reloader plugin_event_accumulator.py:303] Found more than one graph event per run, or there was a metagraph containing a graph_def, as well as one or more graph events. Overwriting the graph with the newest event. W0813 17:48:05.422390 Reloader plugin_event_accumulator.py:311] Found more than one metagraph event per run. Overwriting the metagraph with the newest event.

always remind even though the log dir only have one event file

数据集用户id、商品id等cutoff问题

运行DCN模型跑下面这个数据集时候有些疑问：
http://labs.criteo.com/2014/02/download-kaggle-display-advertising-challenge-dataset/
Kaggle Display Advertising Challenge Dataset
我看里面数据格式是：
The columns are tab separeted with the following schema:
<integer feature 1> ... <integer feature 13> <categorical feature 1> ... <categorical feature 26>
并没有区分用户id、商品id，那这样如何给用户做推荐呢？而且我看get_criteo_feature.py处理的时候，很多categorical 类型数据直接被截断没了，那如何区分开用户呢？
parser.add_argument(
"--cutoff",
type=int,
default=200,
help="cutoff long-tailed categorical values"
)

谢谢！

线性项的维度问题

看到 DeepFM 中一阶项，
PNN中线性项 y_linear = tf.reduce_sum(tf.multiply(feat_wgts, feat_vals),1)，输出都是一个数值而非一个向量；论文中一阶项都是一个向量而非一个数值吧？

如何从原数据获取DIN数据

我下载了数据集，但是aliccp文件里面读取数据部分和数据集命名，格式都不同，无法处理，请问是否更换了数据集？
criteo数据集里只有reademe.txt,train.txt,test.txt，并没有aliccp中的*-*命名，其中也没有“,”分隔符

python & tensorflow 版本

hi，请问一下，你使用的python和tensorflow的版本分别是什么？

DeepFM模型中的损失计算

165行左右，构建deep全连接时，给变量都加上了l2正则
y_deep = tf.contrib.layers.fully_connected(inputs=deep_inputs, num_outputs=1, activation_fn=tf.identity,
weights_regularizer=tf.contrib.layers.l2_regularizer(l2_reg), scope='deep_out')
然后在189行左右定义损失函数
loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=y, labels=labels)) +
l2_reg * tf.nn.l2_loss(FM_W) +
l2_reg * tf.nn.l2_loss(FM_V)

我理解，上面的损失函数没有把前面通过weights_regularizer正则的变量取出来
所以应该改成
loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=y, labels=labels)) +
l2_reg * tf.nn.l2_loss(FM_W) +
l2_reg * tf.nn.l2_loss(FM_V)+
tf.reduce_sum(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES))