hkuds / autocf Goto Github PK

View Code? Open in Web Editor NEW

48.0 4.0 7.0 10.89 MB

[WWW'23] "AutoCF: Automated Self-Supervised Learning for Recommendation"

Home Page: https://arxiv.org/abs/2303.07797

Python 100.00%

automated-machine-learning collaborative-filtering recommender-systems self-supervised-learning

autocf's Introduction

Automated Self-Supervised Learning for Recommendation

This repository contains pyTorch implementation by @akaxlh for AutoCF model proposed in the following paper:

Lianghao Xia, Chao Huang, Chunzhen Huang, Kangyi Lin, Tao Yu and Ben Kao. Automated Self-Supervised Learning for Recommendation. Paper in arXiv. In WWW'23, Austin, US, April 30 - May 4, 2023.

Introduction

To improve the representation quality over limited labeled data, contrastive learning has attracted attention in recommendation and benefited graph-based CF model recently. However, the success of most contrastive methods heavily relies on manually generating effective contrastive views for heuristic-based data augmentation. This does not generalize across different datasets and downstream recommendation tasks, which is difficult to be adaptive for data augmentation and robust to noise perturbation. As shown in the figure below, state-of-the-art SSL methods (e.g. NCL, SimGCL) present severe performance drop in comparison to our AutoCF, facing high-ratio noises and long-tail data distributions.

To fill the crucial gap, this work proposes a unified Automated Collaborative Filtering (AutoCF) to automatically perform data augmentation for recommendation. Specifically, we focus on the generative self-supervised learning framework with a learnable augmentation paradigm that benefits the automated distillation of important self-supervised signals. To enhance the representation discrimination ability, our masked graph autoencoder is designed to aggregate global information during the augmentation via reconstructing the masked subgraph structures. The overall framework of AutoCF is given below.

Citation

@inproceedings{autocf2023,
  author    = {Xia, Lianghao and
               Huang, Chao and
               Huang, Chunzhen and
               Lin, Kangyi and
               Yu, Tao and
               Kao, Ben},
  title     = {Automated Self-Supervised Learning for Recommendation},
  booktitle = {The Web Conference (WWW)},
  year      = {2023},
}

Environment

The implementation for AutoCF is under the following development environment:

python=3.10.4
torch=1.11.0
numpy=1.22.3
scipy=1.7.3

Datasets

We utilize three datasets for evaluating AutoCF: Yelp, Gowalla, and Amazon. Note that compared to the data used in our previous works, in this work we utilize a more sparse version of the three datasets, to increase the difficulty of recommendation task. Our evaluation follows the common implicit feedback paradigm. The datasets are divided into training set, validation set and test set by 70:5:25.

Dataset	# Users	# Items	# Interactions	Interaction Density
Yelp	$42,712$	$26,822$	$182,357$	$1.6\times 10^{-4}$
Gowalla	$25,557$	$19,747$	$294,983$	$5.9\times 10^{-4}$
Amazon	$76,469$	$83,761$	$966,680$	$1.5\times 10^{-4}$

Usage

Please unzip the datasets first. Also you need to create the History/ and the Models/ directories. Switch the working directory to methods/AutoCF/. The command lines to train AutoCF using our pre-trained teachers on the three datasets are as below. The un-specified hyperparameters in the commands are set as default.

Yelp

python Main.py --data yelp --reg 1e-4 --seed 500

Gowalla

python Main.py --data gowalla --reg 1e-6

Amazon

python Main.py --data amazon --reg 1e-5 --seed 500

Important Arguments

reg: It is the weight for weight-decay regularization. We tune this hyperparameter from the set {1e-3, 1e-4, 1e-5, 1e-6, 1e-7, 1e-8}.
seedNum: This hyperparameter denotes the number of seeds in subgraph masking. Recommended values are {200, 500}.

autocf's People

Contributors

Stargazers

Watchers

Forkers

akaxlh ywhuazhong learnerma chaoliuli ssusantachary mauriziofd crystal22

autocf's Issues

关于对比实验的结果

你好,请问AutoCF论文中对比模型的实验结果是通过贵组的SSLRec框架实现的吗

代码问题

抱歉，temNum = maskNodes.shape[0]
temRows = maskNodes[t.randint(temNum, size=[adj._values().shape[0]]).cuda()]
temCols = maskNodes[t.randint(temNum, size=[adj._values().shape[0]]).cuda()]

	newRows = t.concat([temRows, temCols, t.arange(args.user+args.item).cuda(), rows])
	newCols = t.concat([temCols, temRows, t.arange(args.user+args.item).cuda(), cols])

	# filter duplicated
	hashVal = newRows * (args.user + args.item) + newCols
	hashVal = t.unique(hashVal)
	newCols = hashVal % (args.user + args.item)
	newRows = ((hashVal - newCols) / (args.user + args.item)).long()

生成解码图过程这段代码我无法理解，你能为我解释一下吗，感谢

细节疑问

您好，这份工作比较复杂，但效果提升很大。有一些细节我有些疑问：

为什么解码时是从全局U并I的pair set中随机取并对他们连边，然后再最小化他们的相似度，不理解这种随机建边的方式并优化他们的亲密度为什么是合理的。以及公式(9)的\varepsilon 是不是少了个\bar?
为什么encoder和decoder结构需要不一样，encoder也用自注意力加权的方式有什么缺陷吗？

关于对比模型的设置问题

我将原文代码usrEmbeds, itmEmbeds = self.model(encoderAdj, decoderAdj)改成usrEmbeds, itmEmbeds = self.model(self.handler.torchBiAdj）来模拟lightgcn模型，其他地方不动，最后的模型结果recall@20来到了0.079远高于原文中提到的0.0761

代码报错问题

您好，有个问题需要请教一下
torch._C._cuda_init() 这句代码报错
报错信息：
RuntimeError: No CUDA GPUs are available

关于代码中计算注意力的部分，恳请解惑

代码中计算注意力得分是“att = t.einsum('ehd, ehd -> eh', qEmbeds, kEmbeds)”
为什么不是“att = t.einsum('ihd, jhd -> ijh', qEmbeds, kEmbeds)”呢

the reconstruction phase over the masked graph structures

In the ‘reconstruction phase over the masked graph structures’, which part of the code does the recon loss of Eq. (9) correspond to?

Looking forward to your reply, thanks very much!

关于源码

请问为什么我在源码中无法寻找到关于计算Lrecon损失的部分？

代码里没有用到验证集？

这种测试结果是规范的吗

关于实验的问题

感谢您开源的代码和出色的工作，我在使用这份代码跑对比实验的时候，发现训练的时候非常占显存，想知道作者有没有什么好的方法，可以缓解这个问题

Question about original paper

Thank you very much for your wonderful work and open source code. But I can't download the original file of the paper on all platforms. Can you give the download link of the paper or upload the paper to the code repository?

Thank you.

实验性能

为什么这个model 里面没有设置随机种子？在没设置随机种子的情况下，如何能够跑到论文中所给的实验数据呢？

关于局部图构造的过程

作者您好。我发现在开源代码的methods/Model.py的81~97行对应了论文中3.1.2的内容。论文里的描述是计算衡量中心节点的k-阶邻居的分数s，在论文和代码里k=2，根据代码我发现作者是通过以下方式得到中心节点的2阶邻居的初始嵌入和：

根据这一过程我自己做了一些简单的推导，如下图：

可以看到，右侧的E0、E1和E2分别是嵌入表，E1通过E0得到，而E2通过E1和E0得到。以E1的第一行为例，因为用户A1交互过L1和L2，所以A1的一阶邻居节点的嵌入之和为0.5+0.6=1.1，而在计算二阶交互时，再一次做信息聚合，E2第一行对应的值变为1.1+0.3+0.6-1.1，值为0.9，随后根据代码order * embeds减掉3次E0，即0.9-0.3=0.6。但根据实际情况，A1的二阶邻居是A2和A3，其二阶邻居的嵌入和应该是0.2+0.3=0.5，这和计算过程不符合。
我对此存在一些困惑，非常期待您的回复。

Main.py中的参数问题

您好，Main.py里的handler.LoadData()，参数格式应该怎样填写？

为什么YELP数据集中的user只有42712

baseline复现

请问一下AutoR这个方法具体是如何复现的呢？

import setproctitle

“ModuleNotFoundError: No module named 'setproctitle'”

您好，怎么安装setproctitle这个包呀？

Thank you for your wonderful work and open source code. I highly value your paper and the open-source code provided. However, I noticed that the dataset used in your experiments seems to have undergone preprocessing. Could you kindly clarify the steps involved in this process?

长尾实验

您好，您的研究十分出色，冒昧的向您询问论文中的提到的Figure 1中的长尾实验代码可以提供一下吗，感谢

实验问题

你好，请问实验中的数据稀疏性和噪声的实验是怎么做的？希望你能帮我解答，感谢！

关于数据集的问题

你好，感谢你出色的论文和开源的代码，但是我有一个关于数据集的问题。文章中数据集表格的交互次数指的是整个数据集中的交互次数还是分割后训练集的交互次数呢？我看到源码的Amazon trnmat的nnz就是966680了，恳请解答。

嵌入维度的问题

作者您好，感谢您精彩的工作并将其开源。在附录的图11给出了Yelp和Gowalla数据集上不同嵌入大小的性能，可以发现AutoCF在嵌入大小为30以后就没有进一步的性能提升了，而在4.1.3中所有的默认嵌入大小设置为32，这对于其他基线模型（例如LightGCN、SGL）是否公平？因为广泛的设置是64，并且许多工作已经指出这些工作在更大的嵌入设置（例如128和256）下可以取得更好的性能表现。
此外，作者是否有考虑过和SimGCL/XSimGCL进行对比呢？在先前的研究中，SimGCL被证实是一种简单且有效的图对比推荐方法，并且可以取得比SGL、NCL等方法更好的性能。
感谢阅读，期待您的回复。

原始的数据集

你好，公布的源码中的数据集是pkl格式稀疏矩阵，想问下有没有原本的数据集呢？