Git Product home page Git Product logo

otto-de / recsys-dataset Goto Github PK

View Code? Open in Web Editor NEW
282.0 9.0 42.0 215 KB

๐Ÿ› A real-world e-commerce dataset for session-based recommender systems research.

Home Page: https://www.kaggle.com/datasets/otto/recsys-dataset

License: MIT License

Python 100.00%
benchmark dataset e-commerce machine-learning otto recommender-system session-based recommendations kaggle multi-objective-optimization

recsys-dataset's Introduction

OTTO Recommender Systems Dataset

GitHub stars Test suite Kaggle competition OTTO jobs

A real-world e-commerce dataset for session-based recommender systems research.


Get the Data โ€ข Data Format โ€ข Installation โ€ข Evaluation โ€ข FAQ โ€ข License

The OTTO session dataset is a large-scale dataset intended for multi-objective recommendation research. We collected the data from anonymized behavior logs of the OTTO webshop and the app. The mission of this dataset is to serve as a benchmark for session-based recommendations and foster research in the multi-objective and session-based recommender systems area. We also launched a Kaggle competition with the goal to predict clicks, cart additions, and orders based on previous events in a user session.

Key Features

  • 12M real-world anonymized user sessions
  • 220M events, consiting of clicks, carts and orders
  • 1.8M unique articles in the catalogue
  • Ready to use data in .jsonl format
  • Evaluation metrics for multi-objective optimization

Dataset Statistics

Dataset #sessions #items #events #clicks #carts #orders Density [%]
Train 12.899.779 1.855.603 216.716.096 194.720.954 16.896.191 5.098.951 0.0005
Test 1.671.803 1.019.357 13.851.293 12.340.303 1.155.698 355.292 0.0005
mean std min 50% 75% 90% 95% max
Train #events per session 16.80 33.58 2 6 15 39 68 500
Test #events per session 8.29 13.74 2 4 8 18 28 498
#events per session histogram (90th percentile)
mean std min 50% 75% 90% 95% max
Train #events per item 116.79 728.85 3 20 56 183 398 129004
Test #events per item 13.59 70.48 1 3 9 24 46 17068
#events per item histogram (90th percentile)

Get the Data

The data is stored on the Kaggle platform and can be downloaded using their API:

kaggle datasets download -d otto/recsys-dataset

Data Format

The sessions are stored as JSON objects containing a unique session ID and a list of events:

{
    "session": 42,
    "events": [
        { "aid": 0, "ts": 1661200010000, "type": "clicks" },
        { "aid": 1, "ts": 1661200020000, "type": "clicks" },
        { "aid": 2, "ts": 1661200030000, "type": "clicks" },
        { "aid": 2, "ts": 1661200040000, "type": "carts"  },
        { "aid": 3, "ts": 1661200050000, "type": "clicks" },
        { "aid": 3, "ts": 1661200060000, "type": "carts"  },
        { "aid": 4, "ts": 1661200070000, "type": "clicks" },
        { "aid": 2, "ts": 1661200080000, "type": "orders" },
        { "aid": 3, "ts": 1661200080000, "type": "orders" }
    ]
}
  • session - the unique session id
  • events - the time ordered sequence of events in the session
    • aid - the article id (product code) of the associated event
    • ts - the Unix timestamp of the event
    • type - the event type, i.e., whether a product was clicked, added to the user's cart, or ordered during the session

Submission Format

For each session id and type combination in the test set, you must predict the aid values in the label column, which is space delimited. You can predict up to 20 aid values per row. The file should contain a header and have the following format:

session_type,labels
42_clicks,0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
42_carts,0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
42_orders,0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Installation

To run our scripts, you need to have Python 3 and Pipenv installed. Then, you can install the dependencies with:

pipenv sync

Evaluation

Submissions are evaluated on Recall@20 for each action type, and the three recall values are weight-averaged:

$$ score = 0.10 \cdot R_{clicks} + 0.30 \cdot R_{carts} + 0.60 \cdot R_{orders} $$

where $R$ is defined as

$$ R_{type} = \frac{ \sum\limits_{i=1}^N | \{ \text{predicted aids} \}_{i, type} \cap \{ \text{ground truth aids} \}_{i, type} | }{ \sum\limits_{i=1}^N \min{( 20, | \{ \text{ground truth aids} \}_{i, type} | )}} $$

and $N$ is the total number of sessions in the test set, and $\text{predicted aids}$ are the predictions for each session-type (e.g., each row in the submission file) truncated after the first 20 predictions.

For each session in the test data, your task it to predict the aid values for each type that occur after the last timestamp ts the test session. In other words, the test data contains sessions truncated by timestamp, and you are to predict what occurs after the point of truncation.

For clicks there is only a single ground truth value for each session, which is the next aid clicked during the session (although you can still predict up to 20 aid values). The ground truth for carts and orders contains all aid values that were added to a cart and ordered respectively during the session.

Click here to see the labeled session as JSON from above
[
    {
        "aid": 0,
        "ts": 1661200010000,
        "type": "clicks",
        "labels": {
            "clicks": 1,
            "carts": [2, 3],
            "orders": [2, 3]
        }
    },
    {
        "aid": 1,
        "ts": 1661200020000,
        "type": "clicks",
        "labels": {
            "clicks": 2,
            "carts": [2, 3],
            "orders": [2, 3]
        }
    },
    {
        "aid": 2,
        "ts": 1661200030000,
        "type": "clicks",
        "labels": {
            "clicks": 3,
            "carts": [2, 3],
            "orders": [2, 3]
        }
    },
    {
        "aid": 2,
        "ts": 1661200040000,
        "type": "carts",
        "labels": {
            "clicks": 3,
            "carts": [3],
            "orders": [2, 3]
        }
    },
    {
        "aid": 3,
        "ts": 1661200050000,
        "type": "clicks",
        "labels": {
            "clicks": 4,
            "carts": [3],
            "orders": [2, 3]
        }
    },
    {
        "aid": 3,
        "ts": 1661200060000,
        "type": "carts",
        "labels": {
            "clicks": 4,
            "orders": [2, 3]
        }
    },
    {
        "aid": 4,
        "ts": 1661200070000,
        "type": "clicks",
        "labels": {
            "orders": [2, 3]
        }
    },
    {
        "aid": 2,
        "ts": 1661200080000,
        "type": "orders",
        "labels": {
            "orders": [3]
        }
    }
]

To create these labels from unlabeled sessions, you can use the function, ground_truth in labels.py.

Train/Test Split

Since we want to evaluate a model's performance in the future, as would be the case when we deploy such a system in an actual webshop, we choose a time-based validation split. Our train set consists of observations from 4 weeks, while the test set contains user sessions from the following week. Furthermore, we trimmed train sessions overlapping with the test period, as depicted in the following diagram, to prevent information leakage from the future:

We will publish the final test set after the Kaggle competition is finalized. However, until then, participants of the competition can create their truncated test sets from the training sessions and use this to evaluate their models offline. For this purpose, we include a Python script called testset.py:

pipenv run python -m src.testset --train-set train.jsonl --days 2 --output-path 'out/' --seed 42 

Metrics Calculation

You can use the evalute.py script to calculate the Recall@20 for each action type and the weighted average Recall@20 for your submission:

pipenv run python -m src.evaluate --test-labels test_labels.jsonl --predictions predictions.csv

FAQ

How is a user session defined?

  • A session is all activity by a single user either in the train or the test set.

Are there identical users in the train and test data?

  • No, train and test users are completely disjunct.

Are all test aids included in the train set?

  • Yes, all test items are also included in the train set.

How can a session start with an order or a cart?

  • This can happen if the ordered item was already in the customer's cart before the data extraction period started. Similarly, a wishlist in our shop can lead to cart additions without a previous click.

Are aids the same as article numbers on otto.de?

  • No, all article and session IDs are anonymized.

Are most of the clicks generated by our current recommendations?

  • No, our current recommendations generated only about 20% of the product page views in the dataset. Most users reached product pages via search results and product lists.

Are you allowed to train on the truncated test sessions?

  • Yes, for the scope of the competition, you may use all the data we provided.

How is Recall@20 calculated if the ground truth contains more than 20 labels?

  • If you predict 20 items correctly out of the ground truth labels, you will still score 1.0.

Where can I find item and user metadata?

  • This dataset intentionally only contains anonymized IDs. Given its already large size, we deliberately did not include content features to make the dataset more manageable and focus on collaborative filtering techniques that solve the multi-objective problem.

License

The OTTO dataset is released under the CC-BY 4.0 License, while the code is licensed under the MIT License.

Citation

BibTeX entry:

@online{normann2022ottodataset,
  author       = {Philipp Normann, Sophie Baumeister, Timo Wilm},
  title        = {OTTO Recommender Systems Dataset: A real-world e-commerce dataset for session-based recommender systems research},
  date         = {2022-11-01},
}

recsys-dataset's People

Contributors

danielrolfes2307 avatar dependabot[bot] avatar philippnormann avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

recsys-dataset's Issues

How is density calculated?

Thank you so much for making this wonderful dataset and the README.md itself is a beautiful resource to read.

I just wonder how is the density in statistics session is calculated? What kind of info or implications does it provide?

Screen Shot 2022-11-14 at 07 56 58

Thanks again!

scores all zeros despite getting >0.3 on kaggle submission

Hi,

it seems like I'm getting all zero scores when running pipenv run python -m src.evaluate --test-labels test_labels.jsonl --predictions predictions.csv where the test_labels.jsonl refer to the file generated by pipenv run python -m src.testset --train-set train.jsonl --days 2 --output-path 'out/' --seed 42 . However, submitting the same csv to kaggle gives score >0.3. Any suggestions on what might be causing this?

INFO:root:Reading labels from out/test_labels.jsonl
Preparing labels: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 515702/515702 [00:05<00:00, 101878.16it/s]
INFO:root:Read 515702 labels
INFO:root:Reading predictions from predictions.csv
Preparing predictions: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 5015409/5015409 [06:25<00:00, 13000.44it/s]
INFO:root:Read 1671803 predictions
INFO:root:Calculating scores
Evaluating sessions: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 515702/515702 [00:01<00:00, 324037.41it/s]
INFO:root:Scores: {'clicks': 0.0, 'carts': 0.0, 'orders': 0.0, 'total': 0.0}

Thank you!

Question about items

Hi,

It seems that the only thing we know about an item is its aid. Is the aid totally random, or does it carry some information about the function of the item?

beartype error

Hi,

I get the following error complaining about beartype import, when running !pipenv run python3 -m src.testset --train-set train.jsonl --days 2 --output-path 'out/' --seed 42

Any suggestion on how to fix this?

Thanks!

Traceback (most recent call last):
File "/usr/local/Cellar/[email protected]/3.9.0_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/local/Cellar/[email protected]/3.9.0_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/Users/yc/Desktop/kaggle/sample/recsys-dataset/src/testset.py", line 8, in
from beartype import beartype
File "/Users/yc/.local/share/virtualenvs/recsys-dataset-XXnHBDKs/lib/python3.9/site-packages/beartype/init.py", line 57, in
from beartype._decor.decormain import beartype
File "/Users/yc/.local/share/virtualenvs/recsys-dataset-XXnHBDKs/lib/python3.9/site-packages/beartype/_decor/decormain.py", line 24, in
from beartype._data.datatyping import (
File "/Users/yc/.local/share/virtualenvs/recsys-dataset-XXnHBDKs/lib/python3.9/site-packages/beartype/_data/datatyping.py", line 129, in
BeartypeReturn = Union[BeartypeableT, BeartypeConfedDecorator]
File "/usr/local/Cellar/[email protected]/3.9.0_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/typing.py", line 243, in inner
return func(*args, **kwds)
File "/usr/local/Cellar/[email protected]/3.9.0_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/typing.py", line 316, in getitem
return self._getitem(self, parameters)
File "/usr/local/Cellar/[email protected]/3.9.0_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/typing.py", line 421, in Union
parameters = _remove_dups_flatten(parameters)
File "/usr/local/Cellar/[email protected]/3.9.0_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/typing.py", line 215, in _remove_dups_flatten
all_params = set(params)
TypeError: unhashable type: 'list'

can't create train/test split

Hey, thanks for the great dataset!

When I run pipenv run python -m src.testset --train-set train.jsonl --days 2 --output-path 'out/' --seed 42 as specified in the README, I get: ModuleNotFoundError: No module named 'pandas'

Edit: I'm on python 3.10.6

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.