Thank you very much for your contribution! I have a couple of little

Questions about train/test split, and evaluation metrics about purs HOT 4 CLOSED

lpworld commented on July 28, 2024

Questions about train/test split, and evaluation metrics

from purs.

Comments (4)

lpworld commented on July 28, 2024

Hi Kangqi,

Thank you so much for your interest in our paper, and for your careful observation of our codes! Indeed, your suggestions would definitely make our project much better. We wish to clarify the issues you raised as follows:

(1) Regarding the splitting of the training/test set. You are right in the sense that what we are currently doing is not entirely "chronological", because we rank the order of user behaviors in the training set and the test set separately, instead of doing them together. The reason for that is we wish to test the performance of our proposed method with different combinations of user behavioral sequences, i.e., we randomly shuffle the dataset in multiple runs for cross-validation purposes (which is done externally). In this way, we will get a more robust (but also slightly "incorrect", as you point out) evaluation of the recommendation performance. We have followed your advice and conducted the time-based splitting, and we confirm that our performance is still significantly better than the baselines.

(2) Regarding the calculation of HR@10 and Unexpectedness measures. Note that our dataset is very sparse and for each user in the test set, very few of them have interaction records of more than 10. Therefore, when we calculate the HR@10 metric, it is equivalent to calculating HR@ALL, i.e., we calculate the utility values of all possible <history_behavior, target_item> pairs in the test set and determine whether it is suitable for recommendations. The same logic applies to the unexpectedness measure as well. You are absolutely right; ideally, we want to calculate the unexpectedness values only for those products that are actually recommended. However, due to the sparsity of the dataset, if we recommend the top-10 item to the user, we will actually traverse every record in the test set. That is why we choose to fix the <history_behavior, target_item> pairs for evaluation. Again, as you mention, it is not an ideal setting and could lead to biases in the evaluation process.

Sorry for the confusion, and thanks again for your important advice. I hope these can address some of your concerns. Of course, you are more than welcome to contact me with any further comments you have. Have a nice day!

Best Regards,
Pan

from purs.

lkq1992yeah commented on July 28, 2024

Hi Pan,

Thanks for your quick, patient and detailed explanation!

I have fully understood your intuitions of data splitting and evaluation metrics. For HR@10 and unexpectedness, I come up with a new idea: In terms of HR@K and unexpectedness (yes, excluding AUC), though the dataset is relatively sparse, there's no need to limit the evaluation data to those <u, i> pairs with explicit 0/1 labels. Given the current user and his(her) behavior, we can retrieve top-k rated items from the item pool as the final recommendation list, then both HR@K and unexpectedness can be defined in a more straightforward way:

HR@K = N_{click_item_in_top_k_results} / N_positive_data.

This measure is widely used in item retrieval (matching stage) papers, for example, MIND and ComiRec.

unexpectedness = SUM_{k, <u,i>} distance(user_behavior, recommended_item_k) / (K * N_data)

Since the novelty of the recommendation result is completely unrelated with the click label, therefore all testing <u, i> pairs can be leveraged.

Obviously, such evaluation methods could be much more compuatational intensive, as we need to traverse all items from the item pool. Nevertheless, I believe these metrics can effectively enhance your demonstration of the PURS algorithm, considering that the measurement of unexpectedness plays really really an important role in this paper.

And one more question, is there any method to evaluate the unexpectedness of different variations under the same item embedding? Since the item embedding parameters are updated during the training process, it seems that we cannot avoid the bias of model differences. I'm curious if you have thought about this issue, and if there are any good ideas.

Best,
Kangqi

from purs.

lpworld commented on July 28, 2024

Hi Kangqi,

Thank you again for the comments.

Regarding your first point, yes I think that would be an excellent idea and I will definitely try that in our implementations. Thank you so much for the suggestion!

Regarding your second point, it is indeed possible to do so (for example, we could train the item embeddings and the recommendation network separately). However, that would require a fundamental redesign of the currently proposed model. I will explore that in my future work.

Best Regards,
Pan

from purs.

lkq1992yeah commented on July 28, 2024

Thank you for your patient response!

from purs.

Questions about train/test split, and evaluation metrics about purs HOT 4 CLOSED

Comments (4)

Related Issues (14)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent