Git Product home page Git Product logo

tf_repos's People

Contributors

fuhailin avatar lambdaji avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tf_repos's Issues

AFM模型:注意力是不是没有加指数?

    aij = tf.contrib.layers.fully_connected(inputs=deep_inputs, num_outputs=1, activation_fn=tf.identity, \
        weights_regularizer=tf.contrib.layers.l2_regularizer(l2_reg), scope='attention_out')# (None * (F*(F-1))) * 1

    #aij_reshape = tf.reshape(aij, shape=[-1, num_interactions, 1])							# None * (F*(F-1)) * 1
    aij_softmax = tf.nn.softmax(tf.reshape(aij, shape=[-1, num_interactions, 1]), dim=1, name='attention_soft')
    if mode == tf.estimator.ModeKeys.TRAIN:
        aij_softmax = tf.nn.dropout(aij_softmax, keep_prob=dropout[0])

按照原论文,应该是没有加指数吧?还是我理解错了呢

tf distribute

when i run the deepfm in the distribute mode, an error happened:
No worker known as /job:chief/replica:0/task:0
could you help me~

关于wide_deep模型的数据处理

拜读了您的大作,有一个问题想请教下:我看您这边的data process逻辑,都是做成了“类似FM二阶部分, 统一做embedding, <id, val> 离散特征val=1.0”的libsvm格式,但是根据您的wide&deep的模型代码,input_fn中是直接parse csv,是否没有按照libsvm格式来处理?

libsvm数据转化问题

建议可以加个readme
比如get_criteo_feature.py默认测试集的所有特征都在训练集出现过,否则feature_map不全;
比如测试的数据不能太少,不然cutoff都没了;
比如测试集这里跟训练集这里下标差一:val = dists.gen(i, features[continous_features[i] - 1]),然后我改成跟训练集一样的下标了,应该是我的数据格式测试集合训练集是一样的,博主的两者数据坐标差一? 测试集的label = features[0]我也加上去了,这样后面对比测试效果应该能更加方便对比,不然延用训练集的最后一个label感觉怪怪的;
比如数值型连续值不能只有一个唯一值,否则归一化出错;
...........

an error

DeepFM Infer:No such file or directory: '~/deepFM_ex2/data/criteo/pred.txt
How can I get this file?(I have already run feature.py and model_train.py, no error has been reported)

PNN没有进行分解降维?

你好,我看了你的代码,感觉并没有对IPNN做的product做分解是么?还是我的理解有问题?谢谢

ESMM数据集上的全局ID为何不统一

有个疑问,请问特征109_14和特征206上的ID为啥是不统一的,理论上如果是同一个商品类目,ID不应该是一致的吗,我统计了其中40W条数据,发现这两个特征里出现的ID并没有交集,请问是我理解错了这个数据集的含义吗,有人可以解答吗,感谢!

分布式训练

再请教您一个问题
work1 一直在等待INFO:tensorflow:Waiting 1800.000000 secs before starting eval.
WARNING:tensorflow:Estimator is not trained yet. Will start an evaluation when a checkpoint is ready.

work'-0
日志的一部分
INFO:tensorflow:Saving checkpoints for 25076 into /workspace/wlc/model_dir/model.ckpt.
INFO:tensorflow:global_step/sec: 7.5244
E0712 18:25:49.778093 ProjectorPluginIsActiveThread saver.py:1817] Couldn't match files for checkpoint /workspace/wlc/model_dir/model.ckpt-25076
INFO:tensorflow:loss = 84.10749, average_loss = 0.65708977 (31.580 sec)
INFO:tensorflow:loss = 84.10749, step = 25285 (31.580 sec)
E0712 18:26:20.777756 ProjectorPluginIsActiveThread saver.py:1817] Couldn't match files for checkpoint /workspace/wlc/model_dir/model.ckpt-25076
INFO:tensorflow:loss = 81.15384, average_loss = 0.63401437 (25.918 sec)

work-1 一直在等待

TensorBoard 1.6.0 at http://tensorflow-wanglianchen-144-16-worker-1-0grc9:6006 (Press CTRL+C to quit)
INFO:tensorflow:TF_CONFIG environment variable: {'cluster': {'worker': ['tensorflow-wanglianchen-144-16-worker-2:2222'], 'ps': ['tensorflow-wanglianchen-144-16-ps-0:2222'], 'chief': ['tensorflow-wanglianchen-144-16-worker-0:2222']}, 'task': {'type': 'evaluator', 'index': 0}}
INFO:tensorflow:Using config: {'_num_worker_replicas': 0, '_num_ps_replicas': 0, '_global_id_in_cluster': None, '_master': '', '_save_checkpoints_steps': 1000, '_session_config': device_count {
key: "CPU"
value: 1
}
device_count {
key: "GPU"
}
, '_keep_checkpoint_every_n_hours': 10000, '_save_summary_steps': 1000, '_keep_checkpoint_max': 5, '_log_step_count_steps': 1000, '_service': None, '_save_checkpoints_secs': None, '_is_chief': False, '_tf_random_seed': None, '_model_dir': '/workspace/wlc/model_dir/', '_evaluation_master': '', '_task_id': 0, '_cluster_spec': , '_task_type': 'evaluator'}
INFO:tensorflow:Waiting 1800.000000 secs before starting eval.
WARNING:tensorflow:Estimator is not trained yet. Will start an evaluation when a checkpoint is ready.
INFO:tensorflow:Waiting 1799.999588 secs before starting next eval run.
WARNING:tensorflow:Estimator is not trained yet. Will start an evaluation when a checkpoint is ready.
INFO:tensorflow:Waiting 1799.999654 secs before starting next eval run.
WARNING:tensorflow:Estimator is not trained yet. Will start an evaluation when a checkpoint is ready.
INFO:tensorflow:Waiting 1799.999693 secs before starting next eval run.
WARNING:tensorflow:Estimator is not trained yet. Will start an evaluation when a checkpoint is ready.
INFO:tensorflow:Waiting 1799.999667 secs before starting next eval run.
WARNING:tensorflow:Estimator is not trained yet. Will start an evaluation when a checkpoint is ready.
INFO:tensorflow:Waiting 1799.999685 secs before starting next eval run.

work-2运行成功

INFO:tensorflow:loss = 84.003555, average_loss = 0.6562778 (26.016 sec)
INFO:tensorflow:loss = 84.003555, step = 25914 (26.016 sec)
INFO:tensorflow:Loss for final step: 84.82182.
ps_host ['tensorflow-wanglianchen-144-16-ps-0:2222']
worker_host ['tensorflow-wanglianchen-144-16-worker-2:2222']
chief_hosts ['tensorflow-wanglianchen-144-16-worker-0:2222']
{"task": {"index": 0, "type": "worker"}, "cluster": {"ps": ["tensorflow-wanglianchen-144-16-ps-0:2222"], "worker": ["tensorflow-wanglianchen-144-16-worker-2:2222"], "chief": ["tensorflow-wanglianchen-144-16-worker-0:2222"]}}
model_type:wide_deep
train_samples_num:3000000
Parsing /workspace/wlc/wide_deep_dist/data/train.csv
1.0hours
task train success.
modeldir=/workspace/wlc,modelname=model_dir

ps—0 日志
start checkWorkerIsFinish
TensorBoard 1.6.0 at http://tensorflow-wanglianchen-144-16-ps-0-jrngn:6006 (Press CTRL+C to quit)
INFO:tensorflow:TF_CONFIG environment variable: {'cluster': {'worker': ['tensorflow-wanglianchen-144-16-worker-2:2222'], 'ps': ['tensorflow-wanglianchen-144-16-ps-0:2222'], 'chief': ['tensorflow-wanglianchen-144-16-worker-0:2222']}, 'task': {'type': 'ps', 'index': 0}}
INFO:tensorflow:Using config: {'_cluster_spec': , '_task_id': 0, '_model_dir': '/workspace/wlc/model_dir/', '_service': None, '_session_config': device_count {
key: "CPU"
value: 1
}
device_count {
key: "GPU"
}
, '_save_summary_steps': 1000, '_is_chief': False, '_save_checkpoints_secs': None, '_master': 'grpc://tensorflow-wanglianchen-144-16-ps-0:2222', '_global_id_in_cluster': 2, '_evaluation_master': '', '_keep_checkpoint_max': 5, '_save_checkpoints_steps': 1000, '_task_type': 'ps', '_tf_random_seed': None, '_num_worker_replicas': 2, '_log_step_count_steps': 1000, '_num_ps_replicas': 1, '_keep_checkpoint_every_n_hours': 10000}
INFO:tensorflow:Start Tensorflow server.
2018-07-12 17:26:33.154403: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-07-12 17:26:33.160418: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job chief -> {0 -> tensorflow-wanglianchen-144-16-worker-0:2222}
2018-07-12 17:26:33.160444: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:2222}
2018-07-12 17:26:33.160463: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> tensorflow-wanglianchen-144-16-worker-2:2222}
2018-07-12 17:26:33.164749: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:2222

CPU集群Saving checkpoints error

请问在CPU集群运行分布式TF的时候遇到这个问题是咋回事?有啥解决办法吗?
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
could not find method isEncrypted from class org/apache/hadoop/fs/FileStatus with signature ()Z
hdfsGetPathInfo(/user/tdw_gilbertchen/model_path/test/2019080400): getFileInfo error:
java.lang.NoSuchMethodError: isEncrypted
INFO:tensorflow:Graph was finalized.
2019-09-04 11:14:49.486416: I tensorflow/core/distributed_runtime/master_session.cc:1161] Start master session 239ec7870b717670 with config: gpu_options { }
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into hdfs://ss-wxg-3-v2/user/tdw_gilbertchen/model_path/test/2019080400/model.ckpt.
Traceback (most recent call last):
File "/data/user/code/mainRun.py", line 150, in
tf.app.run()
File "/usr/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "/data/user/code/mainRun.py", line 137, in main
tf.estimator.train_and_evaluate(model, train_spec, eval_spec)
File "/usr/lib/python2.7/site-packages/tensorflow/python/estimator/training.py", line 471, in train_and_evaluate
return executor.run()
File "/usr/lib/python2.7/site-packages/tensorflow/python/estimator/training.py", line 637, in run
getattr(self, task_to_run)()
File "/usr/lib/python2.7/site-packages/tensorflow/python/estimator/training.py", line 642, in run_chief
return self._start_distributed_training()
File "/usr/lib/python2.7/site-packages/tensorflow/python/estimator/training.py", line 788, in _start_distributed_training
saving_listeners=saving_listeners)
File "/usr/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 354, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 1207, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/usr/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 1241, in _train_model_default
saving_listeners)
File "/usr/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 1468, in _train_with_estimator_spec
log_step_count_steps=log_step_count_steps) as mon_sess:
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 504, in MonitoredTrainingSession
stop_grace_period_secs=stop_grace_period_secs)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 921, in init
stop_grace_period_secs=stop_grace_period_secs)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 643, in init
self._sess = _RecoverableSession(self._coordinated_creator)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1107, in init
_WrappedSession.init(self, self._create_session())
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1112, in _create_session
return self._sess_creator.create_session()
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 807, in create_session
hook.after_create_session(self.tf_sess, self.coord)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/basic_session_run_hooks.py", line 568, in after_create_session
self._save(session, global_step)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/basic_session_run_hooks.py", line 599, in _save
self._get_saver().save(session, self._save_path, global_step=step)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1441, in save
{self.saver_def.filename_tensor_name: checkpoint_file})
File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: hdfs://ss-wxg-3-v2/user/tdw_gilbertchen/model_path/test/2019080400/model.ckpt-0_temp_6e34385f8bd846499e635fda38324771/part-00000-of-00001.index; Unknown error 255
[[node save/MergeV2Checkpoints (defined at /data/user/code/mainRun.py:137) = MergeV2Checkpoints[delete_old_dirs=true, _device="/job:ps/replica:0/task:0/device:CPU:0"](save/MergeV2Checkpoints/checkpoint_prefixes, _recv_save/Const_0_S581)]]
[[{{node save/Identity_S583}} = _HostRecvclient_terminated=false, recv_device="/job:chief/replica:0/task:0/device:CPU:0", send_device="/job:ps/replica:0/task:0/device:CPU:0", send_device_incarnation=-6548387880355174373, tensor_name="edge_302_save/Identity", tensor_type=DT_STRING, _device="/job:chief/replica:0/task:0/device:CPU:0"]]

Caused by op u'save/MergeV2Checkpoints', defined at:
File "/data/user/code/mainRun.py", line 150, in
tf.app.run()
File "/usr/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "/data/user/code/mainRun.py", line 137, in main
tf.estimator.train_and_evaluate(model, train_spec, eval_spec)
File "/usr/lib/python2.7/site-packages/tensorflow/python/estimator/training.py", line 471, in train_and_evaluate
return executor.run()
File "/usr/lib/python2.7/site-packages/tensorflow/python/estimator/training.py", line 637, in run
getattr(self, task_to_run)()
File "/usr/lib/python2.7/site-packages/tensorflow/python/estimator/training.py", line 642, in run_chief
return self._start_distributed_training()
File "/usr/lib/python2.7/site-packages/tensorflow/python/estimator/training.py", line 788, in _start_distributed_training
saving_listeners=saving_listeners)
File "/usr/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 354, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 1207, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/usr/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 1241, in _train_model_default
saving_listeners)
File "/usr/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 1468, in _train_with_estimator_spec
log_step_count_steps=log_step_count_steps) as mon_sess:
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 504, in MonitoredTrainingSession
stop_grace_period_secs=stop_grace_period_secs)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 921, in init
stop_grace_period_secs=stop_grace_period_secs)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 643, in init
self._sess = _RecoverableSession(self._coordinated_creator)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1107, in init
_WrappedSession.init(self, self._create_session())
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1112, in _create_session
return self._sess_creator.create_session()
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 800, in create_session
self.tf_sess = self._session_creator.create_session()
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 557, in create_session
self._scaffold.finalize()
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 215, in finalize
self._saver.build()
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1114, in build
self._build(self._filename, build_save=True, build_restore=True)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1151, in _build
build_save=build_save, build_restore=build_restore)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 786, in _build_internal
save_tensor = self._AddShardedSaveOps(filename_tensor, per_device)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 369, in _AddShardedSaveOps
return self._AddShardedSaveOpsForV2(filename_tensor, per_device)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 351, in _AddShardedSaveOpsForV2
sharded_prefixes, checkpoint_prefix, delete_old_dirs=True)
File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/gen_io_ops.py", line 473, in merge_v2_checkpoints
delete_old_dirs=delete_old_dirs, name=name)
File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
op_def=op_def)
File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1770, in init
self._traceback = tf_stack.extract_stack()

UnknownError (see above for traceback): hdfs://ss-wxg-3-v2/user/tdw_gilbertchen/model_path/test/2019080400/model.ckpt-0_temp_6e34385f8bd846499e635fda38324771/part-00000-of-00001.index; Unknown error 255
[[node save/MergeV2Checkpoints (defined at /data/user/code/mainRun.py:137) = MergeV2Checkpoints[delete_old_dirs=true, _device="/job:ps/replica:0/task:0/device:CPU:0"](save/MergeV2Checkpoints/checkpoint_prefixes, _recv_save/Const_0_S581)]]
[[{{node save/Identity_S583}} = _HostRecvclient_terminated=false, recv_device="/job:chief/replica:0/task:0/device:CPU:0", send_device="/job:ps/replica:0/task:0/device:CPU:0", send_device_incarnation=-6548387880355174373, tensor_name="edge_302_save/Identity", tensor_type=DT_STRING, _device="/job:chief/replica:0/task:0/device:CPU:0"]]

关于aliccp数据集

请问阿里ccp数据集
你是否做了下采样 我过滤掉非法数据 还有近三千万行数据
压缩成gz也有30多g

DCN模型:field_size、feature_size等参数导致的reshape、embedding_lookup等问题

  1. 请问DCN模型的代码不能直接用于criteo数据集吗?还是要运行的时候传哪几个参数?
  2. 我看代码里面默认field_size是0,这里必须要在运行时候传参吧,比如我的是field_size=2496 ?不传参的话“ feat_vals = tf.reshape(feat_vals, shape=[-1, field_size, 1])”这里reshape成(-1,0)就报错了:
    Reshape cannot infer unless all specified input sizes are non-zero。
    但是传参的话,后面feature_size代码里面默认也是0,且没有计算新值赋值,导致变成Feat_Emb维度是(0, 32),然后又引起
    embeddings = tf.nn.embedding_lookup(Feat_Emb, feat_ids) # None * F * K
    有“indices[0,1872] = 1 is not in [0, 0)”的错误。
  3. 然后我根据tr.libsvm每行有39个空格、feature_map文件有428行,改了下面两个默认参数:
    tf.app.flags.DEFINE_integer("feature_size", 428, "Number of features")
    tf.app.flags.DEFINE_integer("field_size", 39, "Number of fields")
    但是还是有各种维度对不上:
    Assign requires shapes of both tensors to match. lhs shape= [1312,1] rhs shape= [79936,1]

DeepFM gpu利用率问题

lambdaji你好~一直有关注你的知乎和tf_repos,�最近在实践中利用DeepFM实现了一个排序模型,想请教一个实际运用的问题,请问你在实际运用中是否会出现GPU利用率的问题?我这边在训练过程利用率始终在10%以下,如果单GPU资源利用都达不到100%,分布式也就没意义了。。。我用的是Tesla P40,显存有24G,显存应该不是瓶颈,数据规模field有81,�feature index大约是百万级,对利用率问题一直不解,还望指教,多谢!

deepFM中的field_size

你好,我看代码中的field_size是固定死的,但实际中如果遇到每行的field 大小不确定,因为是稀疏的, 所以就不能直接reshape了,请问有相应的解决方案吗?
feat_ids = features['feat_ids']
feat_ids = tf.reshape(feat_ids,shape=[-1,field_size])
feat_vals = features['feat_vals']
feat_vals = tf.reshape(feat_vals,shape=[-1,field_size])

DeepFM效果比FNN差

你好,非常感谢能提供相关CTR训练模型代码,在我的实验结果上DeepFM 比FNN在千分位低5个百分点,理论上讲不应该是高吗,希望可以给出意见,谢谢

分布式执行时出现Reshape error

Caused by op u'Reshape', defined at:
File "DeepFM.py", line 392, in
tf.app.run()
File "/data/hadoop/local/usercache/test/appcache/application_5145270655_21212399/container_1569565_99362122/Python/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "DeepFM.py", line 326, in main
tf.estimator.train_and_evaluate(DeepFM, train_spec, eval_spec)
File "DeepFM.py", line 128, in model_fn
feat_ids = tf.reshape(feat_ids, shape=[-1, field_size])

InvalidArgumentError (see above for traceback): Reshape cannot infer the missing input size for an empty tensor unless all specified input sizes are non-zero
[[Node: Reshape = Reshape[T=DT_INT32, Tshape=DT_INT32, _device="/job:chief/replica:0/task:0/device:CPU:0"](IteratorGetNext, Reshape/shape)]]
[[Node: gradients/Deep-part/deep_out/MatMul_grad/tuple/control_dependency_1_S313 = _Recvclient_terminated=false, recv_device="/job:ps/replica:0/task:0/device:CPU:0", send_device="/job:chief/replica:0/task:0/device:CPU:0", send_device_incarnation=-1178756093214127197, tensor_name="edge_1006_gradients/Deep-part/deep_out/MatMul_grad/tuple/control_dependency_1", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:0/device:CPU:0"]]

导出模型serving文件报错

Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Key Deep-part/mlp2/biases not found in checkpoint
[[node save/RestoreV2 (defined at Model_pipeline/DeepFM.py:366) ]]

how to use tensorborad in this frame

W0813 17:48:05.416769 Reloader plugin_event_accumulator.py:303] Found more than one graph event per run, or there was a metagraph containing a graph_def, as well as one or more graph events. Overwriting the graph with the newest event. W0813 17:48:05.417315 Reloader plugin_event_accumulator.py:311] Found more than one metagraph event per run. Overwriting the metagraph with the newest event. W0813 17:48:05.419010 Reloader plugin_event_accumulator.py:303] Found more than one graph event per run, or there was a metagraph containing a graph_def, as well as one or more graph events. Overwriting the graph with the newest event. W0813 17:48:05.419358 Reloader plugin_event_accumulator.py:311] Found more than one metagraph event per run. Overwriting the metagraph with the newest event. W0813 17:48:05.420542 Reloader plugin_event_accumulator.py:303] Found more than one graph event per run, or there was a metagraph containing a graph_def, as well as one or more graph events. Overwriting the graph with the newest event. W0813 17:48:05.421025 Reloader plugin_event_accumulator.py:311] Found more than one metagraph event per run. Overwriting the metagraph with the newest event. W0813 17:48:05.422015 Reloader plugin_event_accumulator.py:303] Found more than one graph event per run, or there was a metagraph containing a graph_def, as well as one or more graph events. Overwriting the graph with the newest event. W0813 17:48:05.422390 Reloader plugin_event_accumulator.py:311] Found more than one metagraph event per run. Overwriting the metagraph with the newest event.

always remind even though the log dir only have one event file

数据集用户id、商品id等cutoff问题

运行DCN模型跑下面这个数据集时候有些疑问:
http://labs.criteo.com/2014/02/download-kaggle-display-advertising-challenge-dataset/
Kaggle Display Advertising Challenge Dataset
我看里面数据格式是:
The columns are tab separeted with the following schema:
<integer feature 1> ... <integer feature 13> <categorical feature 1> ... <categorical feature 26>
并没有区分用户id、商品id,那这样如何给用户做推荐呢?而且我看get_criteo_feature.py处理的时候,很多categorical 类型数据直接被截断没了,那如何区分开用户呢?
parser.add_argument(
"--cutoff",
type=int,
default=200,
help="cutoff long-tailed categorical values"
)

谢谢!

线性项的维度问题

看到 DeepFM 中一阶项,
PNN中 线性项 y_linear = tf.reduce_sum(tf.multiply(feat_wgts, feat_vals),1),输出都是一个数值而非一个向量;论文中一阶项都是一个向量而非一个数值吧?

如何从原数据获取DIN数据

我下载了数据集,但是aliccp文件里面读取数据部分和数据集命名,格式都不同,无法处理,请问是否更换了数据集?
criteo数据集里只有reademe.txt,train.txt,test.txt,并没有aliccp中的*-*命名,其中也没有“,”分隔符

DeepFM模型中的损失计算

165行左右,构建deep全连接时,给变量都加上了l2正则
y_deep = tf.contrib.layers.fully_connected(inputs=deep_inputs, num_outputs=1, activation_fn=tf.identity,
weights_regularizer=tf.contrib.layers.l2_regularizer(l2_reg), scope='deep_out')
然后在189行左右定义损失函数
loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=y, labels=labels)) +
l2_reg * tf.nn.l2_loss(FM_W) +
l2_reg * tf.nn.l2_loss(FM_V)

我理解,上面的损失函数没有把前面通过weights_regularizer正则的变量取出来
所以应该改成
loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=y, labels=labels)) +
l2_reg * tf.nn.l2_loss(FM_W) +
l2_reg * tf.nn.l2_loss(FM_V)+
tf.reduce_sum(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES))

loss计算问题

loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=y, labels=labels))
这里的计算loss的时候,预测值使用的是y,在代码里面y是未经过sigmoid激活函数的一个输出,想请教一下这样做的原因,因为通常来说分类任务都是会经过sigmoid做预测的。
PS:比较困惑的是,做实验发现不经过sigmoid网络优化的更好,但是一经过sigmoid之后就会差很多。
谢谢!

DeepMLT Converting sparse IndexedSlices to a dense Tensor of unknown shape

遇到
UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape.
This may consume a large amount of memory.
"Converting sparse IndexedSlices to a dense Tensor of unknown shape. "

我跑DeepMLT model时
我的显卡显存是6g的
占用内存Allocation of 16384131072 exceeds 10% of system memory
显存没怎么用到
是代码的问题么

https://github.com/lambdaji/tf_repos/tree/master/DeepMTL/Feature_pipeline

这个路径下有一堆特征处理脚本,特别乱,看的头都大了,请问这些脚本的具体执行顺序是怎么样的?get_join_sample.sh得到的是特征频次没过滤的libsvm格式,要得到最终的特征频次过滤后的libsvm格式,这些脚本应该按什么顺序执行?

code in the PNN.py file has little error

outer = tf.reshape(tf.einsum('aik,ajk->aijk', p, q), [-1, num_pairsembedding_sizeembedding_size])
should be:
outer = tf.reshape(tf.einsum('api,apj->apij', p, q), [-1, num_pairsembedding_sizeembedding_size])

libsvm数据格式的问题

get_criteo_feature.py中的特征下标应该是从1开始的,PNN.py中直接使用idx,这样的话使用tf.nn.embedding_lookup进行embedding是从0开始?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.