open-spaced-repetition / srs-benchmark Goto Github PK

View Code? Open in Web Editor NEW

50.0 4.0 8.0 402.42 MB

A benchmark for spaced repetition schedulers/algorithms

Home Page: https://github.com/open-spaced-repetition/fsrs4anki/wiki

Python 1.03% Jupyter Notebook 98.97%

spaced-repetition deep-learning fsrs machine-learning spaced-repetition-algorithm benchmark research

srs-benchmark's Introduction

SRS Benchmark

Introduction

Spaced repetition algorithms are computer programs designed to help people schedule reviews of flashcards. A good spaced repetition algorithm helps you remember things more efficiently. Instead of cramming all at once, it distributes your reviews over time. To make this efficient, these algorithms try to understand how your memory works. They aim to predict when you're likely to forget something, so they can schedule a review accordingly.

This benchmark is a tool designed to assess the predictive accuracy of various algorithms. A multitude of algorithms are evaluated to find out which ones provide the most accurate predictions.

Dataset

The dataset for the SRS benchmark comes from 20 thousand people who use Anki, a flashcard app. In total, this dataset contains information about ~1.7 billion reviews of flashcards. The full dataset is hosted on Hugging Face Datasets: open-spaced-repetition/FSRS-Anki-20k.

Evaluation

Data Split

In the SRS benchmark, we use a tool called TimeSeriesSplit. This is part of the sklearn library used for machine learning. The tool helps us split the data by time: older reviews are used for training and newer reviews for testing. That way, we don't accidentally cheat by giving the algorithm future information it shouldn't have. In practice, we use past study sessions to predict future ones. This makes TimeSeriesSplit a good fit for our benchmark.

Note: TimeSeriesSplit will remove the first split from evaluation. This is because the first split is used for training, and we don't want to evaluate the algorithm on the same data it was trained on.

Metrics

We use three metrics in the SRS benchmark to evaluate how well these algorithms work: log loss, AUC, and a custom RMSE that we call RMSE (bins).

Log Loss (also known as Binary Cross Entropy): Utilized primarily for its applicability in binary classification problems, log loss serves as a measure of the discrepancies between predicted probabilities of recall and review outcomes (1 or 0). It quantifies how well the algorithm approximates the true recall probabilities, making it an important metric for algorithm evaluation in spaced repetition systems. Log Loss ranges from 0 to infinity, lower is better.
AUC (Area under the ROC Curve): AUC represents the degree or measure of separability. It tells how much the algorithm is capable of distinguishing between classes. AUC ranges from 0 to 1, however, in practice it's almost always greater than 0.5; higher is better.
Root Mean Square Error in Bins (RMSE (bins)): This is a metric designed for use in the SRS benchmark. In this approach, predictions and review outcomes are grouped into bins based on three features: the interval length, the number of reviews, and the number of lapses. Within each bin, the squared difference between the average predicted probability of recall and the average recall rate is calculated. These values are then weighted according to the sample size in each bin, and then the final weighted root mean square error is calculated. This metric provides a nuanced understanding of algorithm performance across different probability ranges. For more details, you can read The Metric. RMSE (bins) ranges from 0 to 1, lower is better.

Algorithms

FSRS v3: the first version of the FSRS algorithm that people actually used.
FSRS v4: the upgraded version of FSRS, made better with help from the community.
FSRS-4.5: the minorly improved version based on FSRS v4. The shape of the forgetting curve has been changed. This benchmark also includes FSRS-4.5 with default parameters (which have been obtained by running FSRS-4.5 on all 20 thousand collections) and FSRS-4.5 where only the first 4 parameters (values of initial stability after the first review) are optimized and the rest are set to default.
FSRS-5: the latest version of FSRS. Unlike the previous versions, it takes into account same-day reviews. Same-day reviews are used only for training, and not for evaluation.
FSRS-rs: the Rust port of FSRS-5. See also: https://github.com/open-spaced-repetition/fsrs-rs
GRU: a type of neural network that's often used for making predictions based on a sequence of data. It's a classic in the field of machine learning for time-related tasks.
- GRU-P: a variant of GRU that removes forgetting curve and predicts the probability of recall directly.
- GRU-P-short: same as above, but it also takes into account same-day reviews.
DASH: the model proposed in this paper. The name stands for Difficulty, Ability, and Study History. In our benchmark, we only use the Ability and Study History because the Difficulty part is not applicable to our dataset. We also added two other variants of this model: DASH[MCM] and DASH[ACT-R]. For further information, please refer to this paper.
ACT-R: the model proposed in this paper. It includes an activation-based system of declarative memory. It explains the spacing effect by the activation of memory traces.
HLR: the model proposed by Duolingo. Its full name is Half-Life Regression. For further information, please refer to the this paper.
Transformer: a type of neural network that has gained popularity in recent years due to its superior performance in natural language processing. ChatGPT is based on this architecture.
SM-2: one of the early algorithms used by SuperMemo, the first spaced repetition software. It was developed more than 30 years ago, and it's still popular today. Anki's default algorithm is based on SM-2, Mnemosyne also uses it. This algorithm does not predict the probability of recall natively; therefore, for the sake of the benchmark, the output was modified based on some assumptions about the forgetting curve.
NN-17: a neural network approximation of SM-17. It has a comparable number of parameters, and according to our estimates, it performs similarly to SM-17.
AVG: an "algorithm" that outputs a constant equal to the user's average retention. Has no practical applications and is intended only to serve as a baseline.

For further information regarding the FSRS algorithm, please refer to the following wiki page: The Algorithm.

Result

Total number of users: 19,990.

Total number of reviews for evaluation: 702,721,850. Same-day reviews are excluded except in FSRS-5 and GRU-P-short, i.e., each algorithm uses only one review per day (the first, chronologically). Some reviews are filtered out, for example, the revlog entries created by changing the due date manually or reviewing cards in a filtered deck with "Reschedule cards based on my answers in this deck" disabled. Finally, an outlier filter is applied. These are the reasons why the number of reviews used for evaluation is significantly lower than the figure of 1.7 billion mentioned earlier.

The following tables present the means and the 99% confidence intervals. The best result is highlighted in bold. The rightmost column shows the number of optimizable (trainable) parameters. If a parameter is a constant, it is not included.

Weighted by the number of reviews

Model	#Params	LogLoss	RMSE(bins)	AUC
GRU-P-short	297	0.313±0.0051	0.0420±0.00085	0.707±0.0029
GRU-P	297	0.318±0.0053	0.0435±0.00091	0.697±0.0027
FSRS-5	19	0.320±0.0052	0.050±0.0010	0.700±0.0028
FSRS-rs	19	0.322±0.0053	0.052±0.0011	0.692±0.0029
FSRS-4.5	17	0.324±0.0053	0.052±0.0011	0.693±0.0027
FSRSv4	17	0.329±0.0056	0.057±0.0012	0.690±0.0027
DASH	9	0.331±0.0053	0.060±0.0010	0.642±0.0031
DASH[MCM]	9	0.331±0.0054	0.062±0.0011	0.644±0.0030
DASH[ACT-R]	5	0.334±0.0055	0.065±0.0012	0.632±0.0031
FSRSv3	13	0.360±0.0068	0.070±0.0015	0.667±0.0029
FSRS-5-pretrain	4	0.338±0.0058	0.072±0.0016	0.685±0.0028
NN-17	39	0.346±0.0069	0.075±0.0015	0.595±0.0031
GRU	39	0.375±0.0072	0.079±0.0016	0.658±0.0027
FSRS-5-dry-run	0	0.346±0.0060	0.081±0.0017	0.681±0.0028
ACT-R	5	0.354±0.0057	0.084±0.0019	0.536±0.0030
AVG	0	0.354±0.0059	0.085±0.0019	0.508±0.0029
HLR	3	0.404±0.0079	0.102±0.0020	0.632±0.0034
SM2	0	0.54±0.012	0.147±0.0029	0.599±0.0031
Transformer	127	0.51±0.011	0.182±0.0033	0.515±0.0043

Unweighted

Model	Parameters	Log Loss	RMSE (bins)	AUC
GRU-P	297	0.345±0.0030	0.0655±0.00082	0.679±0.0017
GRU-P-short	297	0.340±0.0030	0.0659±0.00084	0.687±0.0019
FSRS-5	19	0.346±0.0031	0.0712±0.00084	0.697±0.0017
FSRS-rs	19	0.348±0.0031	0.0726±0.00086	0.693±0.0017
FSRS-4.5	17	0.352±0.0032	0.0742±0.00088	0.688±0.0017
DASH	9	0.358±0.0031	0.0810±0.00095	0.632±0.0018
FSRSv4	17	0.362±0.0033	0.082±0.0010	0.685±0.0016
DASH[MCM]	9	0.358±0.0030	0.0831±0.00094	0.636±0.0019
DASH[ACT-R]	5	0.362±0.0033	0.086±0.0011	0.627±0.0019
FSRS-5-pretrain	4	0.359±0.0032	0.0866±0.00091	0.692±0.0016
NN-17	39	0.380±0.0035	0.100±0.0013	0.570±0.0018
FSRS-5-dry-run	0	0.373±0.0032	0.101±0.0011	0.691±0.0016
AVG	0	0.385±0.0036	0.101±0.0011	0.500±0.0018
ACT-R	5	0.395±0.0040	0.106±0.0012	0.524±0.0018
FSRSv3	13	0.422±0.0046	0.106±0.0013	0.661±0.0017
GRU	39	0.440±0.0052	0.108±0.0013	0.650±0.0017
HLR	3	0.456±0.0051	0.124±0.0013	0.636±0.0018
Transformer	127	0.554±0.0064	0.185±0.0018	0.527±0.0021
SM2	0	0.71±0.013	0.199±0.0021	0.604±0.0018

Averages weighted by the number of reviews are more representative of "best case" performance when plenty of data is available. Since almost all algorithms perform better when there's a lot of data to learn from, weighting by n(reviews) biases the average towards lower values.

Unweighted averages are more representative of "average case" performance. In reality, not every user will have hundreds of thousands of reviews, so the algorithm won't always be able to reach its full potential.

The image below shows the p-values obtained by running the Wilcoxon signed-rank test on the RMSE of all pairs of algorithms. Red means that the row algorithm performs worse than the corresponding column algorithm, and green means that the row algorithm performs better than the corresponding column algorithm. Grey means that the p-value is >0.01, and we cannot conclude that one algorithm performs better than the other.

Almost all p-values are extremely small, many orders of magnitude smaller than 0.01. Of course, p-values this low beg the question, "Can we even trust these values?". scipy.stats.wilcoxon itself uses an approximation for n>50, and our modified implementation uses an approximation to return the decimal logarithm of the p-value rather than the p-value itself, to avoid the limitations of 64-bit floating point numbers. So it's an approximation of an approximation. But more importantly, this test is not weighted, meaning that it doesn't take into account the fact that RMSE depends on the number of reviews.

Overall, these p-values can be trusted on a qualitative (but not quantitative) level.

Default Parameters

FSRS-5:

0.4197, 1.1869, 3.0412, 15.2441,
7.1434, 0.6477, 1.0007, 0.0674,
1.6597, 0.1712, 1.1178,
2.0225, 0.0904, 0.3025, 2.1214,
0.2498, 2.9466,
0.4891, 0.6468

Comparisons with SuperMemo 15/16/17

Please refer to the following repositories:

How to run the benchmark

Requirements

Dataset (tiny): #28 (comment)

Dependencies:

pip install -r requirements.txt

Commands

FSRS-4.5:

python script.py

FSRS-4.5 with default parameters:

DRY_RUN=1 python script.py

FSRS-rs:

FSRS_RS=1 FSRS_NO_OUTLIER=1 PYTHONPATH=~/Codes/anki/out/pylib:~/Codes/anki/pylib python script.py

Please change the PYTHONPATH variable to the path of your Anki source code.

FSRSv4/FSRSv3/HLR/LSTM/SM2:

MODEL=FSRSv4 python other.py

Please change the MODEL variable to FSRSv3, HLR, GRU, or SM2 to run the corresponding model.

Dev model in fsrs-optimizer:

DEV_MODE=1 python script.py

Please place the fsrs-optimizer repository in the same directory as this repository.

srs-benchmark's People

Contributors

Stargazers

Watchers

Forkers

ankitects nivla1993 rlzhi shomvel obezzad chrislongros expertium

srs-benchmark's Issues

[Feature request] Add the ACT-R model (see paper)

Relevant paper.
Main formulas:

The appendix has example calculations.

It only has 4 parameters: decay intercept (a), decay scale (c), activation threshold (𝜏) and noise (s). Seems to be fairly straightforward to implement. In the paper 𝜏 and s are fixed, but I think we should make them optimizable. I don't see why they should be fixed.
Btw, a reminder: please add the Transformer to the table with p-values.

[Question] A “raw” version of the tiny_dataset.zip

In #28 a link to a pre-processed small dataset was shared.

While testing different ways of converting review logs of different spacing algorithms
to FSRS, my evaluation on ~7000 reviews generated using an EmacsLisp implementation of py-fsrs
suggests that updating the difficulty and stability for reviews with an interval greater than 1 day
is slightly better than using the (re)learning/review states of the py-fsrs implementation.

To make sure I didn't make any mistake in my evaluation code and test on larger datasets,
I'd like to retry this experiment using the code and datasets of this benchmark
but I can't do so with the “tiny_dataset.zip” because the delta_t have been rounded to days.

Would it be possible to get access to a similar dataset either in an unprocessed format
or with floating-point delta_t values?

This seems to be related to a difference in how the benchmark and the optimizer
implement the FSRS algorithm (using the first review of each day, as I understand it) and how it's
implemented in e.g. py-fsrs (using states to decide when to update the parameters).
I'm not sure how to compare the two approaches other than using review logs from FSRS
and testing if the recall prediction would have been more accurate if we had included
reviews that occurred in the (re)learning state but after a sufficiently large interval or on a different day.

Can we remove d99c0ae from the dataset?

FSRS4 ignores it, but FSRS-rs creates a file for it, meaning that the file needs to be manually removed to be a fair comparison.

[Feature request] A quantitative measure of cheating

I have an idea how to measure the degree of "cheatiness" of an algorithm.

Do the same procedure that you do for plotting the calibration graph.
Record the number of values in the densest bin, aka the highest bar. Example:
Divide it by the total number of reviews. For a cheating algorithm, this will be 100% since there is only one bin, so 100% of reviews fall into that bin.
Do this for every user for a given algorithm.
Calculate the (unweighted) average.

From a theoretical point of view, the issue is that the cutoff will be arbitrary. If the average is 90%, meaning that on average 90% of predicted R values fall within the same bin, is it cheating? What about 80%? Or 50%?
From a practical point of view, this will require re-running every single algorithm since this information cannot be obtained from .json result files right now. At the very least, you will have to re-run FSRS-4.5, ACT-R and DASH[ACT-R], since we are sure that FSRS-4.5 isn't cheating, and ACT-R algorithms are the main suspects. But of course, to get a better idea of what values of this metric are good and what values are bad, you need to re-run the entire benchmark.

Also, this is not intended to be included in the readme. It's for our internal testing.

visualize metrics over time

Neural network scheduler

Hello, I believe our approaches are not compatible, but ClarityInMadness seems otherwise convinced https://www.reddit.com/r/Anki/comments/1cdm1y2/ What are your thoughts?

Inconsistency in the description

The description says that there are 72 collections and 6240084 reviews, but the actual dataset has 71 collections and 4632965 reviews.

[TODO] Add DASH and its variants

Jones, M. N. (Ed.). (2016). Predicting and Improving Memory Retention: Psychological Theory Matters in the Big Data Era. In Big Data in Cognitive Science (0 ed., pp. 43–73). Psychology Press. https://doi.org/10.4324/9781315413570-8

Randazzo, Giacomo. (2020-21). Memory Models for Spaced Repetition Systems (Tesi di Laurea Magistrale in Mathematical Engineering - Ingegneria Matematica, Politecnico di Milano). Advisor: Marco D. Santambrogio. Retrieved from https://hdl.handle.net/10589/186407

[Question] Why does HLR perform so poorly?

I tried reading this code, and while I don't really understand how it works, it seems that there is some sort of training procedure. So I'm surprised that it performs about as poorly as SM-2.

Add MCC

I'll copy what I said in Discord

Another request: calculate Matthew's correlation coefficient: https://en.wikipedia.org/wiki/Phi_coefficient

Select 0.5 as the threshold, and convert all predicted probabilities to binary values

# do this, but using a for loop or list comprehensions or something 
if y_pred > 0.5:
    y_pred = 1
else:
    y_pred = 0

Calculate the confusion matrix, and obtain the four values from it

from sklearn.metrics import confusion_matrix
tn, fp, fn, tp = confusion_matrix(y_true, y_pred, labels=[0, 1]).ravel()

Calculate MCC:

import math

def mcc(tn, fp, fn, tp):
    sums = [tp + fp, tp + fn, tn + fp, tn + fn]
    n_zero = 0
    for i in range(4):
        if sums[i] == 0:
            n_zero += 1

    if n_zero == 0:
        x = sums[0] * sums[1] * sums[2] * sums[3]  # I tried using np.prod, but it outputs negative values sometimes, probably due to an overflow
        # I also have to use math.sqrt, because np.sqrt doesn't work for very large numbers
        return ((tp * tn) - (fp * fn)) / math.sqrt(x)
    # if one of the four sums in the denominator is zero, return 0
    elif n_zero == 1:
        return 0
    # if two of the four sums are zero, return 1 or -1, depending on TP, TN, FP and FN
    # https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-019-6413-7
    elif n_zero == 2:
        if tp != 0 or tn != 0:
            return 1
        elif fp != 0 or fn != 0:
            return -1
    # if more than two sums are zero, return None
    elif n_zero > 2:
        return None

I linked 2 articles about MCC (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9938573 and https://ieeexplore.ieee.org/abstract/document/9440903). It has an advantage over AUC - it takes into account all four numbers (true positives, true negatives, false positives, false negatives), whereas AUC only takes into account two.

So with AUC and MCC added, we will have 2 calibration metrics and 2 classification metrics, which is more than enough

Why not benchmark SSP-MMC-Plus?

https://github.com/maimemo/SSP-MMC-Plus

Is there a reason this isn't added to the benchmark?

Ebisu?

Thank you for this very interesting analysis! If you all feel inclined to include it, I'd be curious to see how Ebisu compares.

[Feature Request] Group users into single dataset

Currently each user is treated as an isolated dataset and the results are taken by a weighted average by various schemes.

It would be helpful to allow models that learns from one user and can apply it to others. even without card data. A simple one would just take the current FSRS or LSTM benchmark and apply regularization to the parameters relative to the individual mean (or the mean can be its own parameter).

[Feature request] Add confidence intervals for all metrics

Calculating the confidence interval is very easy if we assume that the values of metrics (RMSE, logloss, etc.) are distributed normally. I really hope they are, otherwise this will get complicated.
CI=const * st. dev. / sqrt(N)
For a 99% confidence interval, the constant is 2.576. sqrt(N) is just the square root of the number of collections, and st. dev is the standard deviation of the selected metric.
So all you need is just RMSE and logloss values for every collection, which you, of course, have. There is one tricky part, however. Since we are using a weighted average (with the number of reviews as a weight), the standard deviation also has to be weighted. I found a short and elegant solution after a little bit of googling:
np.sqrt(np.cov(values, aweights=weights))
Here values would be logloss or RMSE of each collection, and weights would be the number of reviews of each collection.
This should make our benchmark more rigorous as well as give us an idea of whether some algorithm actually performs better than another or whether it's just a fluke.

[Question] Some more details from a ML perspective

Hey, first of all thank you for the dataset. I was wondering if you could provide some more details for people who want to train their own ML algorithms but might not be familiar with the internals of Anki.

Let me see if I understand the dataset correctly:

The dataset consists of n * m time series where n are the number of users and m are the number of cards.
Each time series has 3 features: days between last review, rating 1-4, current number of reviews

What is the target variable? And how are you handling the time series for RNNs?

[Feature Request] Train a gradient-boosted decision tree

Although transformers are probably what would give the best performance with enough training and tweaking of hyperparameters, I suspect that a gradient boosted decision tree ensemble model might outperform FSRS with very little tweaking using a methodology similar to this: https://machinelearningmastery.com/xgboost-for-time-series-forecasting/. It would, however be a much heavier model with many more parameters than even the LSTM that was attempted.

This is something i'd be interested in exploring if I could have access to the training data.

Write an article about binned RMSE and cheating calibration metrics

Cannot download dataset from huggingface

from datasets import load_dataset

raw_datasets = load_dataset("open-spaced-repetition/FSRS-Anki-20k")

produces the following error:

Generating train split: 720284748 examples [05:41, 2111906.74 examples/s]
Traceback (most recent call last):
  File "/home/nieradzik/.conda/envs/pytorch2.1.1/lib/python3.11/site-packages/datasets/builder.py", line 2011, in _prepare_split_single
    writer.write_table(table)
  File "/home/nieradzik/.conda/envs/pytorch2.1.1/lib/python3.11/site-packages/datasets/arrow_writer.py", line 585, in write_table
    pa_table = table_cast(pa_table, self._schema)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nieradzik/.conda/envs/pytorch2.1.1/lib/python3.11/site-packages/datasets/table.py", line 2295, in table_cast
    return cast_table_to_schema(table, schema)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nieradzik/.conda/envs/pytorch2.1.1/lib/python3.11/site-packages/datasets/table.py", line 2249, in cast_table_to_schema
    raise CastError(
datasets.table.CastError: Couldn't cast
card_id: null
review_th: null
delta_t: null
rating: null
__index_level_0__: null
-- schema metadata --
pandas: '{"index_columns": ["__index_level_0__"], "column_indexes": [{"na' + 780
to
{'card_id': Value(dtype='int64', id=None), 'review_th': Value(dtype='int64', id=None), 'delta_t': Value(dtype='int64', id=None), 'rating': Value(dtype='int64', id=None)}
because column names don't match

During handling of the above exception, another exception occurred:


Traceback (most recent call last):
  File "/home/nieradzik/anki/download.py", line 3, in <module>
    raw_datasets = load_dataset("open-spaced-repetition/FSRS-Anki-20k")
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nieradzik/.conda/envs/pytorch2.1.1/lib/python3.11/site-packages/datasets/load.py", line 2609, in load_dataset
    builder_instance.download_and_prepare(
  File "/home/nieradzik/.conda/envs/pytorch2.1.1/lib/python3.11/site-packages/datasets/builder.py", line 1027, in download_and_prepare
    self._download_and_prepare(
  File "/home/nieradzik/.conda/envs/pytorch2.1.1/lib/python3.11/site-packages/datasets/builder.py", line 1122, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/home/nieradzik/.conda/envs/pytorch2.1.1/lib/python3.11/site-packages/datasets/builder.py", line 1882, in _prepare_split
    for job_id, done, content in self._prepare_split_single(
  File "/home/nieradzik/.conda/envs/pytorch2.1.1/lib/python3.11/site-packages/datasets/builder.py", line 2013, in _prepare_split_single
    raise DatasetGenerationCastError.from_cast_error(
datasets.exceptions.DatasetGenerationCastError: An error occurred while generating the dataset

All the data files must have the same columns, but at some point there are 1 new columns ({'__index_level_0__'})

This happened while the csv dataset builder was generating data using

hf://datasets/open-spaced-repetition/FSRS-Anki-20k/dataset/2/10054.csv (at revision 9440578f519d7113db474c284bba7828fcbeccaf)

Please either edit the data files to have matching columns, or separate them into different configurations (see docs at https://hf.co/docs/hub/datasets-manual-configuration#multiple-configurations)

Revlogs parsing

revlogs2dataset.zip
Here are the stats_pb2.py and revlogs2dataset.py
Also, here are the 10 revlog.
10.zip
For file 1 I expected this result, card_id,review_th,delta_t,rating
0,1,-1,3
0,2,0,3
0,3,4,3
0,163,6,4
0,237,1,2
0,380,11,4
1,4,-1,3
1,14,0,1
1,16,0,1
1,21,0,3
1,30,0,3
1,111,2,3
1,160,4,4
1,340,8,3
2,5,-1,1
2,7,0,1
2,10,0,3
2,17,0,3
2,101,2,4
2,158,4,3
2,243,1,2
2,352,7,4
2,384,4,2 from revlog 1, but got this result:
card_id,review_th,delta_t,rating
0,4863,-1,3
0,4864,0,3
0,4997,4,3
0,5846,5,4
0,6105,2,2
0,6745,10,4
1,4998,-1,3
1,5008,0,1
1,5010,0,1
1,5015,0,3
1,5024,0,3
1,5276,1,3
1,5843,4,4
1,6371,9,3
2,4999,-1,1
2,5001,0,1
2,5004,0,3
2,5011,0,3
2,5266,1,4
2,5841,4,3
2,6111,2,2
2,6383,7,4
2,6800,4,2

accidental post

Updating the benchmark with new data

open-spaced-repetition/fsrs4anki#493 (comment)
Now that Dae has provided us with far more data, I'd like you to update the benchmark repo and include the following algorithms:

FSRS v4.5 (after we figure out whether the extra parameters are worth using or not)
FSRS v4
FSRS v4 (dry run)
FSRS v3
LSTM
HLR
SM-2
Ebisu (once you figure out the details with the dev)

LSTM (short-term) is optional IMO. I'm planning to make a reddit post, and I will need those 8 algorithms. Since benchmarking will take a lot more CPU time now, I can also help you to speed it up by doing some of it myself, if you want me to. Also, please make the dataset downloadable using the python download_data.py command, I want to re-do my test of button usage and RMSE.

Add Memrise to the comparison

It's not very important, but I do want to see how well (or, rather, how poorly) it performs.
https://memrise.zendesk.com/hc/en-us/articles/360015889057-How-does-the-spaced-repetition-system-work-
Next review in: 4 hours > 12 hours > 24 hours > 6 days > 12 days > 48 days > 96 days > 180 days.
My ancient python code:

def memrise(history):
    ivl = 0
    reps = 0
    for delta_t, rating in history:
        delta_t = delta_t.item()
        rating = rating.item() + 1
        intervals = [1, 6, 12, 24, 48, 96, 180]
        if rating > 1:
            reps += 1
            if reps > 7:
                reps = 7
            ivl = intervals[reps-1]
        else:
            ivl = 1
            reps = 1
    return ivl

dataset['memrise_interval'] = dataset['tensor'].map(memrise)
dataset['memrise_p'] = np.exp(np.log(0.9) * dataset['delta_t'] / dataset['memrise_interval'])

The obvious issue is that it's unclear what retention level it aims for. I guess we should use 90%. I tried searching for anything that could give me a hint, but found nothing. Btw, what about HLR? I don't know what % it aims for.
As for description, you can add something like: "Memrise, the algorithm used by a language learning app Memrise.".

Using the mode to find the best default parameters

This will be the new issue to discuss this, as I was polluting the previous issue.
@L-M-Sherlock here's some interesting data from my preliminary testing; with sqrt(n) in pretrain, 107 489 380 reviews from 2988 users. All three estimates (half-range, half-sample, kernel density) seem to agree with one another in most, but not all cases. Below are the values of modes obtained by these three estimators after sorting the values:

[0.3534, 0.3601, 0.4564]
[0.9, 0.9, 0.903]
[2.3, 2.5091, 2.5861]
[10.9, 10.9, 365.0]
[5.0375, 5.049, 5.0491]
[1.0596, 1.0597, 1.0598]
[0.7406, 0.8595, 0.86]
[0.0, 0.0, 0.0]
[1.49, 1.49, 1.49]
[0.1, 0.1, 0.1]
[0.94, 0.94, 0.9405]
[2.1257, 2.18, 2.1803]
[0.01, 0.01, 0.01]
[0.3399, 0.34, 0.34]
[1.26, 1.26, 2.0]
[0.0, 0.0, 0.0]
[2.61, 2.61, 4.0]

In order to calculate the final value, I use all three estimators (HRM, HSK, KDE), and then take the average of the two closest ones. This isn't used in the data above, just to clarify.
S0 for "Easy" is the problem. It's either the default value of 10.9, or the max. value, so modes don't help in this case.

[Feature Request] Add a Transformer

This isn't high priority at all, of course, but I would like to see how well a Transformer neural network can perform.
Slightly unrelated, but please add the number of parameters used by LSTM to the description.

Also, ideally, the number of parameters in LSTM and in the Transformer should be roughly the same to make the comparison fair and clearly see how much the difference in architecture matters.

[Feature Request] Add a BiLSTM

Here's my code:
biLSTM.zip

It's basically other.py, but you don't need to specify the model, just run set DEV_MODE=1 && python biLSTM.py. I removed other algorithms and changed class LSTM. The problem is - I don't get any errors, but I don't get any output files either. You said that it could be a RAM issue, but I tried setting n_hidden to 1 and still got no output. Since there are no errors, I found it hard to debug. So I want you to do 2 things:

Find out what's the problem with my code
Benchmark the bi-directional LSTM and add it to the table