coder-yu / selfrec Goto Github PK
View Code? Open in Web Editor NEWAn open-source framework for self-supervised recommender systems.
An open-source framework for self-supervised recommender systems.
你好,请问下面XSimGCL模型中的这行代码怎么理解
ego_embeddings += torch.sign(ego_embeddings) * F.normalize(random_noise, dim=-1) * self.eps
这应该是论文中数据增强部分,是否对应sign(e)点乘X,如果是的话,X我没有理解,如果不是的话,请问 F.normalize(random_noise, dim=-1) * self.eps这部分是什么意思?
Thank you very much for your job.
The output of "top-20items.txt" is useful, so I want to output the results of LightGCN with this code.
However, the results were lower than expected. Could you please tell me the cause of this?
LightGCN - INFO - ### model configuration ###
LightGCN - INFO - training.set=./dataset/yelp2018/train.txt
LightGCN - INFO - test.set=./dataset/yelp2018/test.txt
LightGCN - INFO - model.name=LightGCN
LightGCN - INFO - model.type=graph
LightGCN - INFO - item.ranking=-topN 10,20
LightGCN - INFO - embbedding.size=64
LightGCN - INFO - num.max.epoch=500
LightGCN - INFO - batch_size=2048
LightGCN - INFO - learnRate=0.001
LightGCN - INFO - reg.lambda=0.0001
LightGCN - INFO - LightGCN=-n_layer 3
LightGCN - INFO - output.setup=-dir ./results/
LightGCN - INFO - ###Evaluation Results###
LightGCN - INFO - ['Top 10\n', 'Precision:0.03074712643678161\n', 'Recall:0.03410035637045833\n', 'F1:0.03233704450729225\n', 'NDCG:0.03980414986661795\n', 'Top 20\n', 'Precision:0.02679202980927119\n', 'Recall:0.05948183429247423\n', 'F1:0.03694372783847195\n', 'NDCG:0.049277584262262614\n']
The results by 1000 epochs were almost the same.
Is there anything I got wrong? Thanks.
您好!想请问下Sec3.2中公式9计算L-uniform
的代码是如下吗?
def uniform_loss(x, t=2):
return torch.pdist(x, p=2).pow(2).mul(-t).exp().mean().log()
# user_e 和 item_e 是卷积得到的final_embedding
uniform = (uniform_loss(user_e) + uniform_loss(item_e)) / 2
求问大佬,使用SELFRec好像GPU利用率并不高,需要设置什么参数可以提高运行速度
相同的数据集其他的模型跑就不会出现这个问题,但是NCL中就会有这个报错该怎么解决呢?
File "main.py", line 37, in
rec.execute()
File "/root/SELFRec/SELFRec.py", line 25, in execute
eval(recommender).execute()
File "/root/SELFRec/base/recommender.py", line 73, in execute
self.train()
File "/root/SELFRec/model/graph/NCL.py", line 121, in train
self.fast_evaluation(epoch)
File "/root/SELFRec/base/graph_recommender.py", line 91, in fast_evaluation
rec_list = self.test()
File "/root/SELFRec/base/graph_recommender.py", line 57, in test
item_names = [self.data.id2item[iid] for iid in ids]
File "/root/SELFRec/base/graph_recommender.py", line 57, in
item_names = [self.data.id2item[iid] for iid in ids]
KeyError: 22225
I am facing error for 'tensorflow' has no attribute 'contrib' while executing the algo MHCN,
is there any repo to support advanced version of tansorflow .
AttributeError Traceback (most recent call last)
Input In [2], in <cell line: 4>()
29 exit(-1)
30 rec = SELFRec(conf)
---> 31 rec.execute()
32 e = time.time()
33 print("Running time: %f s" % (e - s))
File C:\SELFRec-main/SELFRec-main\SELFRec.py:28, in SELFRec.execute(self)
26 exec(import_str)
27 recommender = self.config['model.name'] + '(self.config,self.training_data,self.test_data,**self.kwargs)'
---> 28 eval(recommender).execute()
File C:\SELFRec-main/SELFRec-main\base\recommender.py:71, in Recommender.execute(self)
69 self.print_model_info()
70 print('Initializing and building model...')
---> 71 self.build()
72 print('Training Model...')
73 self.train()
File C:\SELFRec-main/SELFRec-main\model\graph\MHCN.py:62, in MHCN.build(self)
60 self.weights = {}
61 self.n_channel = 4
---> 62 initializer = tf.contrib.layers.xavier_initializer()
63 self.user_embeddings = tf.Variable(initializer([self.data.user_num, self.emb_size]))
64 self.item_embeddings = tf.Variable(initializer([self.data.item_num, self.emb_size]))
AttributeError: module 'tensorflow' has no attribute 'contrib'
大佬我想请教一下SimGCL中公式(9)的实现, 我看论文是两个特征相减求得L2范数, 但是loss里面的l2_reg_loss()实现是单纯的范数求和, 我个人理解公式是一种类似pair-wise的相对的特征一致, 但是代码里面实现的loss就是希望范数小, 不知道大佬是怎么考虑的
我在对uniformity loss复现时发现uniformity loss趋势和论文中展示的差不多,数值不太一样,不知道是因为我的uniformity loss实现和你不一样或者是我对你的采样策略理解有误,方便公布一下代码吗
Hello,
I was wondering where I can find the original dataset you are using here? I understand the accompanied dataset has already been preprocessed (btw exactly how its been pre-processed is unclear to me). Is there a resource access to the raw dataset?
For instance I was expecting the amazon-book dataset to be something similar to this where you can visualize the data as done in this example. It's unclear to me whether there is one standard dataset being used (as I see multiple papers in this survey referencing) or whether there are different variations.
Thanks!
作者您好,有幸拜读了您的两篇杰出工作SimGCL与XSimGCL,特别是XSimGCL它是那么简单优雅并且有效。
我在关注两个模型所使用的InfoNCE函数时发现它只关注正样本之间的相似度,而对负样本之间的相似度不关心,请问这是出于什么目的呢?(有可能这个问题比较简单,如能答复不胜感激)
def InfoNCE(view1, view2, temperature: float, b_cos: bool = True):
if b_cos:
view1, view2 = F.normalize(view1, dim=1), F.normalize(view2, dim=1)
pos_score = (view1 @ view2.T) / temperature
score = torch.diag(F.log_softmax(pos_score, dim=1))
return -score.mean()
也是因为与XsimGCL论文中损失函数不一致让我感到困惑
我看ssl_sequential_models 的选择中有DuoRec,请问后面会更新这个模型的实现代码吗,我看该模型的issue中很多人都说达不到论文中的效果,但作者都没有回应
您好!我发现SELFRec中的data在处理训练集样本时:
假设训练数据内容是用户0与项目0、1、2交互,用户1与项目0、1交互,即总共有5条交互数据在训练集中。对于训练集的构造,SELFRec是使用了(0,0,_),(0,1,_),(0,2,_),(1,0,_),(1,1,_)
三元组的3个元素分别代表用户ID、正样本ID、负样本ID,下划线代表随机抽取的负样本。并且保证了训练集内不重不漏地包含了所有的交互数据。请问我的理解正确吗?
但是在LightGCN和SGL中,我发现它们使用的方法都是先采样训练集样本数个用户,然后再得到正负样本,请问这两种方法的哪种更为合理呢?如果要把LightGCN作为对比实验,是不是需要定义类似它一样的采样方法呢?以及如果我想设计我自己的推荐算法,您建议我应该使用哪种采样方式呢?
非常感谢!
Thank you very much for your great work.
I compared the accuracy of the two models (SASRec and CL4SRec in your code) in Amazon-beauty, and SASRec was superior.
Empirically, contrastive learning based on InfoNCE should work better, pushing the representation space of sequential embeddings too.
Could you tell me what do you think of this? Is this due to defferences in datasets and lack of hyperparameter tuning (or the all-ranking protocol)?
当我通过训练出一个模型后,如果我有了新的数据该如何在原有模型的基础上继续训练(新的数据可能包含之前数据集中未出现过的物品和用户)?我看到模型初始化会根据输入的数据构建一个scr矩阵,我是要在原有csr矩阵的基础上补上新数据的部分然后训练,还是仅对新的数据训练后给旧的embedding加上新增数据的embedding。
When training the sequential models such as CL4SRec
, after a few epochs of training, I'm getting nans for the batch_loss and rec_loss
. For instance see the output below:
## CL4SRec
Epoch: 2, Hit Ratio:0.02066 | Precision:0.00103 | Recall:0.02066 | NDCG:0.00784
*Best Performance*
Epoch: 2, Hit Ratio:0.02066 | NDCG:0.00784
------------------------------------------------------------------------------------------------------------------------
training: 3 batch 50 batch_loss: 0.5157323479652405 rec_loss: 0.4582507908344269
Evaluating the model...
Progress: [++++++++++++++++++++++++++++++++++++++++++++++++++]100%
------------------------------------------------------------------------------------------------------------------------
Real-Time Ranking Performance (Top-20 Item Recommendation)
*Current Performance*
Epoch: 3, Hit Ratio:0.03372 | Precision:0.00169 | Recall:0.03372 | NDCG:0.01302
*Best Performance*
Epoch: 3, Hit Ratio:0.03372 | NDCG:0.01302
------------------------------------------------------------------------------------------------------------------------
training: 4 batch 50 batch_loss: 0.4821103513240814 rec_loss: 0.4299513101577759
Evaluating the model...
Progress: [++++++++++++++++++++++++++++++++++++++++++++++++++]100%
------------------------------------------------------------------------------------------------------------------------
Real-Time Ranking Performance (Top-20 Item Recommendation)
*Current Performance*
Epoch: 4, Hit Ratio:0.04212 | Precision:0.00211 | Recall:0.04212 | NDCG:0.01622
*Best Performance*
Epoch: 4, Hit Ratio:0.04212 | NDCG:0.01622
------------------------------------------------------------------------------------------------------------------------
training: 5 batch 50 batch_loss: nan rec_loss: nan
Any ideas what could be causing this?
P.S. This is training with the amazon-beauty
datasets, some of the other datasets don't load with this model.
请问一下作者SimGCL论文里使用的amazon-book数据集是该仓库的/data/amazon-kindle里内容吗?
您好!针对画分布图的这段代码,
ue = open('user', 'rb')
user = pickle.load(ue)
udx = np.random.choice(len(user), 2000)
embs = ['user_lgcn.emb', 'user_sgl.emb', 'user_simgcl.emb']
models = ['LightGCN','SGL-ED','SimGCL']
data = {}
想请教下SimGCL中的分布图画法中的user embedding和item embedding是被优化后的初始embedding:
def _init_model(self):
initializer = nn.init.xavier_uniform_
embedding_dict = nn.ParameterDict({
'user_emb': nn.Parameter(initializer(torch.empty(self.data.user_num, self.emb_size))),
'item_emb': nn.Parameter(initializer(torch.empty(self.data.item_num, self.emb_size))),
})
return embedding_dict
还是说经过模型forward后得到的user embedding和item embedding呢
作者您好,我想请教一下SimGCL中特征分布图的具体画法,若能解答不胜感激
Hey Yu. Your work is so impressful. Thanks for your open source contribution. I also have some problems when using this project. From the figure below you can see, as the implementation of SSL, key point of this work is to construct negative samples by masking or dropout tech. I noticed that you constructed negative samples from item list which are not interacted with user in-batch. But the '_neg' are not the input of loss calculation. Could you please explain this and help solve my questions? thanks a lot!
大佬您的SimGCL的文章实验里面DNN+SSL应该对应的就是SSL4Rec这个模型吧, 但是这个模型我在yelp数据集上跑结果如下
Best Performance
Epoch: 22, Hit Ratio:0.0183126791239777 | Precision:0.009372236958443855 | Recall:0.02152239714180159 | MDCG:0.01612651549405071
但是我看您的两篇论文SimGCL和XSimGCL中的结果比这个好很多recall达到0.0483, 不知道是不是我哪里设置得有问题, 我就改了conf文件中的数据位置为yelp
ego_embeddings = torch.sparse.mm(self.sparse_norm_adj, ego_embeddings)
你好,请问对于上述代码中的sparse_norm_adj,这一稀疏矩阵生成的依据是什么?做了实验,发现这个矩阵很重要。
SELFRec/base/graph_recommender.py
Lines 107 to 111 in 3fc66eb
前辈这里的代码,如果没有保存过最好的结果应该是append 到外面的吗?
for m in measure[1:]:
k, v = m.strip().split(':')
performance[k] = float(v)
- self.bestPerformance.append(performance)
+self.bestPerformance.append(performance)
self.save()
你好,感谢你的开源框架和开源数据集。但是我注意到,在Xsimgcl这篇文章中,有一个Amazon-Electronics数据集我没有在这里找到,请问下这个数据集可以开源吗?
望回复,万分感谢。
I am impressed by your work XSimGCL on self-supervised learning for GNN-based recommender systems. I have read your paper and code, and I would like to reproduce your results on the Amazon-Electronics dataset. However, I could not find the pre-processed version of this dataset or the detailed pre-processing pipeline. Could you please kindly share the pre-process settings of the Amazon-Electronics dataset, such as the train/test split file and the k-core setting? I would greatly appreciate your help. Thank you for your time and attention.
大佬你好,我是RS这块的菜鸟,我发现按照默认设置运行的话好像没有使用到验证集,验证集是必要的吗?
您好,感谢大佬提供一个简洁高效的框架。在训练SimGCL的过程中我发现似乎随着cl_rate的增大,模型在训练初期需要更多的epoch来实现在验证集上的推荐效果的增长,您认为如果对SimGCL采用early stop策略,阈值设置为多少比较稳妥?
大佬你好, 我是个RS小白, 最近看了BUIR的代码使用的yelp数据集, 个人理解最后预测的就是用户会不会和物品交互, 并不会预测rating的数值. 不知道是不是这样理解, 数据集的txt第三列是个数字就行?
因为我看ui_graph.py 中的training_set_u 这个变量虽然记录了rating, 但是在graph_recommender.py下的test() 函数里面的并没有用变量li的内容.
其次我想问一下, ui_graph.py中的__create_sparse_bipartite_adjacency()函数中rating是不是也和数据集中rating没有关系, 因为是用np.ones_like()生成的, 只是代表是否相连的关系?
Dear Yu,
Thank you very much for your outstanding work. I have been attempting to build a recommendation system framework from scratch recently, but I encountered some issues. I attempted to find answers within SELFRec, but I'm still a bit confused. Therefore, I would like to ask you about two questions, hoping to receive your guidance:
Firstly, I notice that you have already prepared preprocessing for large sparse datasets like Yelp. I want to know your cleaning rules. Currently, I've only done some basic processing:
# Filter out users with less than 5 occurrences
user_counts = data_df['user_id'].value_counts()
data_df = data_df[data_df['user_id'].isin(user_counts[user_counts >= 5].index)]
# Create a mapping of unique user and item IDs to sequential integers
user_id_map = {id: i for i, idx in enumerate(data_df['user_id'].unique())}
item_id_map = {id: i for i, idx in enumerate(data_df['item_id'].unique())}
data_df['user_id'] = data_df['user_id'].map(user_id_map)
data_df['item_id'] = data_df['item_id'].map(item_id_map)
# Filter out interactions with a rating less than 3
data_df = data_df.loc[data_df['rating'] >= 3].copy() # Make a copy of the filtered DataFrame
data_df.drop(columns=['rating'], inplace=True)
Secondly, I want to know the strategies you use when conducting quick evaluation. When I use a full ranking for evaluation, the code runs relatively slowly.
Thank you for your valuable work again. I've been reading your article on self-supervised learning recently, and it's been absolutely fascinating.
如果有的话, 想催更~
作者您好,请问在项目中,SGL-ND、SGL-ED、SGL-RW和SGL-WA的代码是如何运行的?
SimGCL论文的原文中有这样一段话:“We split the datasets into three parts (training set, validation set, and test set) with a ratio of 7:1:2.”
但是我看数据集中并没有提供验证集,只有训练集和测试集,这是为什么呢?
Excellent job!SimGCL是GCL4Rec近年来最有启发的工作,论文中说SGL中的图增强是不必要的部分,不做任何图增强的SGL-WA应该表现也不差,可我在SimGCL去除噪声模块后进行对比学习表现却非常差
Best Performance
Epoch: 2, Hit Ratio:0.00173 | Precision:0.00088 | Recall:0.00203 | NDCG:0.00165
请问这是因为什么问题呢
Dear authors,
Thanks a lot for sharing the code and datasets for your model. Would you kindly share the settings you used to obtain the results for SimGCL on Amazon Book? That would be useful for my research.
Thank you again.
Best regards,
Daniele
您好,请问SELFRec支持在测试中使用多个测试集吗?例如对同一个数据集,我有两个测试集A和B,我希望他们在测试时都一并进行测试了
Hello please i need your help im trying to run SelfCF Code but i have an issue, i need to run the file SelfRec also or only SelfCF.py file please?
Thanks in advance
(ICLR'23) LightGCL: Simple Yet Effective Graph Contrastive Learning for Recommendation
冒昧打扰前辈,不好意思。
SimGCL上的工作很有启发性,但是我有点疑问。文章提出关键不是droupout增强,而是均匀的分布达到去偏的效果
但题目起的是《Are Graph Augmentations Necessary? Simple Graph
Contrastive Learning for Recommendation》让我觉得有一些迷惑,dourpout base的增广在 GBRSs 上是不太好的,其他非 droupout 增广方法在 GBRs 上是否有必要?
Thank you for your excellent work. I noticed slightly different regularization loss in different models.
In LightGCN, the total loss is:
batch_loss = bpr_loss(user_emb, pos_item_emb, neg_item_emb) + l2_reg_loss(self.reg, user_emb,pos_item_emb,neg_item_emb)/self.batch_size
In SGL, the total loss is:
batch_loss = rec_loss + l2_reg_loss(self.reg, user_emb, pos_item_emb,neg_item_emb) + cl_loss
In SimSGL, the total loss is:
batch_loss = rec_loss + l2_reg_loss(self.reg, user_emb, pos_item_emb) + cl_loss
The l2_reg_loss() of these three losses are different. Is there something I missed? Looking forward to your reply.
请问模型正常情况下是测试比训练慢很多的吗,我监测中gpu基本没用
你好,其它数据集的在对比学习中温度t的最佳参数都是什么呢
作者您好,我是debias of RS的初学者,最近拜读了SimGCL这篇工作,想请问一下论文2.3节Fig.2以及3.2节Fig.4的绘制代码是否会考虑开源。盼望您的回复。
When I was debugging, I found recommendation list containing duplicate element.
At base/graph_recommender.py, 87 line, rec_list = self.test().
I got a rec_list whose content is {'22': [('1190', 0.00026), ('325', 0.00023), ('325', 0.00023), ('166', 0.00017), ('166', 0.00017)],....}
'325' and '166' appear twice.
I checked your code. I found bug at util/algorithm.py, 152 line , program is for iid, score in enumerate(candidates)
Your candidates[:K] has been assigned to n_candidates, but 152 line still traverse all candidate and failure to avoid repetition.
When I changed util/algorithm.py 152 line to for iid, score in enumerate(candidates[K:]):
and 153 line is iid = iid + K
.
Bug was fixed.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.