maciejkula / spotlight Goto Github PK

Deep recommender models using PyTorch.

License: MIT License

Python 97.73% Shell 2.27%

recommender-system deep-learning learning-to-rank python machine-learning matrix-factorization pytorch

spotlight's Issues

I ran this command but met problems

I ran this command:conda install -c maciejkula -c pytorch -c peterjc123 spotlight=0.1.4
but showed the error
conda install -c maciejkula -c pytorch -c peterjc123 spotlight=0.1.3
Fetching package metadata ...........
Solving package specifications:

PackageNotFoundError: Packages missing in current channels:

spotlight 0.1.3* -> pytorch 0.3.0 -> mkl >=2018

Segmentation Fault with Pandas

I ran into a very odd segmentation fault error. This could very well be a PyTorch bug, but I thought I'd bring it up here, first. I've produced a minimal example at the bottom of this issue.

So far, I know that the fault happens at the loss.backward() call in model.fit(). The fault only seems to happen under the combination of two conditions (that I can find, so far):

When sparse=True.
Pandas is imported at the top of the file

(BTW, I pass in an SGD optimizer because that seems to be the only one that works right now with sparse embeddings)

I'm using pandas version 0.20.3 from conda, the latest spotlight from master, and PyTorch 0.2.0 from conda. I'd love to know if others can reproduce this.

As I said, this could very well be a PyTorch bug, but, if others run into this, it'll be helpful to have this issue as a reference.

import pandas as pd
import numpy as np
import torch

from spotlight.interactions import Interactions
from spotlight.factorization.implicit import ImplicitFactorizationModel

user_ids = [2471, 5808, 3281, 4086, 6293, 8970, 11828, 3281]
item_ids = [1583, 57, 6963, 867, 8099, 10991, 24, 800]
num_users = 15274
num_items = 25655

train = Interactions(np.array(user_ids, dtype=np.int64),
                     np.array(item_ids, dtype=np.int64),
                     num_users=num_users,
                     num_items=num_items)

def optimizer_func(params, lr=0.01):
    return torch.optim.SGD(params, lr=lr)
  
RANDOM_STATE = np.random.RandomState(42)
model = ImplicitFactorizationModel(loss='bpr',
                                   embedding_dim=32,
                                   batch_size=4,
                                   n_iter=1,
                                   use_cuda=False,
                                   optimizer_func=optimizer_func,
                                   sparse=True,
                                   random_state=RANDOM_STATE)
# Fault
model.fit(train, verbose=True)

LightFM sample_weight equivalent in Spotlight?

Hi @maciejkula,
Is there a LightFM sample_weight equivalent in Spotlight?
The scenario is there are two types of ratings in the problem:

Explicit (user has actually bought the item)
Implicit (user has searched for item, but not purchased)

I believe this is good cause to utilize sample_weights, but please correct me if I'm wrong.
For folks stumbling upon this, sample weight in LightFM is defined as:

sample_weight: np.float32 coo_matrix of shape [n_users, n_items], optional
     matrix with entries expressing weights of individual
     interactions from the interactions matrix.
     Its row and col arrays must be the same as
     those of the interactions matrix. For memory
     efficiency its possible to use the same arrays
     for both weights and interaction matrices.
     Defaults to weight 1.0 for all interactions.
     Not implemented for the k-OS loss.

Searching Spotlight didn't reveal interaction weights used anywhere except in cross_validation.py like:
weights=_index_or_none(interactions.weights...

Whereas I was hoping to see something more like LightFM's _lightfm_fast.pyx.template:
loss = weight * (prediction - y)

Is there a reason this is missing?
How hard would it be to add to the ImplicitFactorization and ImplicitSequence models?

MRR score uses a "mean" instead of "min"

If I'm not mistaken, MRR score is a mean of reciprocal rank scores over a set of examples, where reciprocal rank is given as 1 / (rank of the first positive item).

Currently implementation in spotlight takes a mean over ranks of positive items. I think this is not correct. It is also inconsistent with lightfm. It also makes it impossible to have mrr score of 1 even with perfect model (even if positive items are ranked [1, 2, 3], taking the mean gives result lower than 1).

Can I send a PR to fix this?

Upgrade to PyTorch 0.2.0

How to predict items for a particular user ?

As far as i understand the model can predict a sequence of items for a given item/items in sequence, but how to get personalized predictions for a given user?

Is it possible to extract the learned user embeddings or item embeddings from the model which can be used to generate a predicted weight by doing a dot product?

Wrong prediction result

I am on the master branch of pytorch and using also the master brach of the spotlight.

Test code:

from spotlight.cross_validation import random_train_test_split
from spotlight.datasets.movielens import get_movielens_dataset
from spotlight.evaluation import mrr_score
from spotlight.factorization.implicit import ImplicitFactorizationModel

dataset = get_movielens_dataset(variant='1M')

train, test = random_train_test_split(dataset,test_percentage=0.01)
print train
print test
model = ImplicitFactorizationModel(n_iter=1,embedding_dim=32,
                                   loss='bpr', use_cuda=True)
model.fit(train,verbose=True)
mrr = mrr_score(model, test)
print mrr.mean()

Put a breakpoint at mrr_score

def mrr_score(model, test, train=None):
    """
    Compute mean reciprocal rank (MRR) scores. One score
    is given for every user with interactions in the test
    set, representing the mean reciprocal rank of all their
    test items.

    Parameters
    ----------

    model: fitted instance of a recommender model
        The model to evaluate.
    test: :class:`spotlight.interactions.Interactions`
        Test interactions.
    train: :class:`spotlight.interactions.Interactions`, optional
        Train interactions. If supplied, scores of known
        interactions will be set to very low values and so not
        affect the MRR.

    Returns
    -------

    mrr scores: numpy array of shape (num_users,)
        Array of MRR scores for each user in test.
    """

    test = test.tocsr()

    if train is not None:
        train = train.tocsr()

    mrrs = []
    for user_id, row in enumerate(test):
        if not len(row.indices):
            continue

        predictions = -model.predict(user_id)
        import pdb
        pdb.set_trace()
        if train is not None:
            predictions[train[user_id].indices] = FLOAT_MAX

        mrr = (1.0 / st.rankdata(predictions)[row.indices]).mean()

        mrrs.append(mrr)

    return np.array(mrrs)

Let's run and debug

python test_spotlight.py
<Interactions dataset (6041 users x 3953 items x 990206 interactions)>
<Interactions dataset (6041 users x 3953 items x 10003 interactions)>
Epoch 0: loss 0.136451368182
> /home/sunlin/anaconda2/lib/python2.7/site-packages/spotlight/evaluation.py(48)mrr_score()
-> if train is not None:
(Pdb) predictions.shape
(15626209,)
(Pdb)

The correct shape is 3953, but the wrong shape 15626209=3953*3953

The root cause is the BilinearNet model, a fix is to change

        return dot + user_bias + item_bias

        return dot.view(-1,1) + user_bias + item_bias

Is this to do with my version of pytorch?

Conda Install Issues.

Hi,
I ran

conda install --prefix=venv3 -c maciejkula -c pytorch spotlight=0.1.2

and it returned

Fetching package metadata ...............
Solving package specifications:

PackageNotFoundError: Packages missing in current channels:

  - spotlight 0.1.2* -> pytorch 0.2.0

We have searched for the packages in the following channels:

  - https://conda.anaconda.org/maciejkula/osx-64
  - https://conda.anaconda.org/maciejkula/noarch
  - https://conda.anaconda.org/pytorch/osx-64
  - https://conda.anaconda.org/pytorch/noarch
  - https://repo.continuum.io/pkgs/main/osx-64
  - https://repo.continuum.io/pkgs/main/noarch
  - https://repo.continuum.io/pkgs/free/osx-64
  - https://repo.continuum.io/pkgs/free/noarch
  - https://repo.continuum.io/pkgs/r/osx-64
  - https://repo.continuum.io/pkgs/r/noarch
  - https://repo.continuum.io/pkgs/pro/osx-64
  - https://repo.continuum.io/pkgs/pro/noarch

I was able to install it a few days ago with the above command. Wonder if something changed?

I created a new project from scratch, created a new conda environment and ran

conda install -c maciejkula -c pytorch spotlight=0.1.2

and got the same error.

warp loss

Do you have any plans to complete warp loss?

Negative sampling

@maciejkula I guess we should remove the items found the training dataset before the negative sampling. Otherwise, it might make the learning less effective?

Add more metrics

precision@k
recall@k
AUC

Enable GPU test runs

We'd like to be able to run test on GPU (even if that is only manual for now).

Loading movielens dataset does not work

This url is non existent: https://github.com/maciejkula/recommender_datasets/releases/download/

Upload to PyPi

It could be useful to upload spotlight to PyPi, particularly since it's a pure Python package, and besides installing the PyTorch dependency, there is no advantage in installing it from a custom conda channel.

Roadmap: Hybrid Recommender?

Hi,
I was looking at LightFM and saw item and user metadata being used for recommendations. This is really cool. Just wondering if such functionality is in the roadmap for spotlight?

Sequential model improvements

Hi!
Very cool project.

There are some potential improvements to sequential model found in Improved Recurrent Neural Networks for Session-based Recommendations.

Randomly dropping items from sequences helps to avoid over-fitting for website structure and improves MRR by ~8% in my experiments with proprietary click-stream data. If done per-batch at training phase the memory overhead could be avoided.

Faster recurrent units like GRU, QRNN or SRU could strike a better performance/accuracy trade-off than causal convolution model.

Serialization and Online learning

I saw the "Add model serialization" is on the Trello to do list.

If I can serialize the model, can I just reload the old model and just continue the training with the new interactions data? But I guess there would be learning rate problem with the Adam optimizer at least. What do you do in practice? Can you recommend me something to read?

Thank you!

Issue caused by Implicit Sequence Model

Very good project. However, I got an issue when I try making batch prediction by Implicit Sequence Model:

Here I modified your example codes a little bit:

import hashlib
import json
import os
import shutil
import sys

import numpy as np

from sklearn.model_selection import ParameterSampler

from spotlight.datasets.movielens import get_movielens_dataset
from spotlight.cross_validation import user_based_train_test_split
from spotlight.sequence.implicit import ImplicitSequenceModel
from spotlight.sequence.representations import CNNNet
from spotlight.evaluation import sequence_mrr_score

CUDA = True

NUM_SAMPLES = 10

LEARNING_RATES = [1e-3, 1e-2, 5 * 1e-2, 1e-1]
LOSSES = ['bpr', 'hinge', 'adaptive_hinge', 'pointwise']
BATCH_SIZE = [8, 16, 32, 256]
EMBEDDING_DIM = [8, 16, 32, 64, 128, 256]
N_ITER = list(range(5, 20))
L2 = [1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 0.0]

class Results:

    def __init__(self, filename):
        self._filename = filename

        open(self._filename, 'a+')

    def _hash(self, x):

        return hashlib.md5(json.dumps(x, sort_keys=True).encode('utf-8')).hexdigest()

    def save(self, hyperparams, test_mrr, validation_mrr):

        result = {'test_mrr': test_mrr,
                  'validation_mrr': validation_mrr,
                  'hash': self._hash(hyperparams)}
        result.update(hyperparams)

        with open(self._filename, 'a+') as out:
            out.write(json.dumps(result) + '\n')

    def best(self):

        results = sorted([x for x in self],
                         key=lambda x: -x['test_mrr'])

        if results:
            return results[0]
        else:
            return None

    def __getitem__(self, hyperparams):

        params_hash = self._hash(hyperparams)

        with open(self._filename, 'r+') as fle:
            for line in fle:
                datum = json.loads(line)

                if datum['hash'] == params_hash:
                    del datum['hash']
                    return datum

        raise KeyError

    def __contains__(self, x):

        try:
            self[x]
            return True
        except KeyError:
            return False

    def __iter__(self):

        with open(self._filename, 'r+') as fle:
            for line in fle:
                datum = json.loads(line)

                del datum['hash']

                yield datum


def sample_cnn_hyperparameters(random_state, num):

    space = {
        'n_iter': N_ITER,
        'batch_size': BATCH_SIZE,
        'l2': L2,
        'learning_rate': LEARNING_RATES,
        'loss': LOSSES,
        'embedding_dim': EMBEDDING_DIM,
        'kernel_width': [3, 5, 7],
        'num_layers': list(range(2, 10)),
        'dilation_multiplier': [1, 2],
        'nonlinearity': ['tanh', 'relu'],
        'residual': [True, False]
    }

    sampler = ParameterSampler(space,
                               n_iter=num,
                               random_state=random_state)

    for params in sampler:
        params['dilation'] = list(params['dilation_multiplier'] ** (i % 8)
                                  for i in range(params['num_layers']))

        yield params


def sample_lstm_hyperparameters(random_state, num):

    space = {
        'n_iter': N_ITER,
        'batch_size': BATCH_SIZE,
        'l2': L2,
        'learning_rate': LEARNING_RATES,
        'loss': LOSSES,
        'embedding_dim': EMBEDDING_DIM,
    }

    sampler = ParameterSampler(space,
                               n_iter=num,
                               random_state=random_state)

    for params in sampler:

        yield params


def sample_pooling_hyperparameters(random_state, num):

    space = {
        'n_iter': N_ITER,
        'batch_size': BATCH_SIZE,
        'l2': L2,
        'learning_rate': LEARNING_RATES,
        'loss': LOSSES,
        'embedding_dim': EMBEDDING_DIM,
    }

    sampler = ParameterSampler(space,
                               n_iter=num,
                               random_state=random_state)

    for params in sampler:

        yield params


def evaluate_cnn_model(hyperparameters, train, test, validation, random_state):

    h = hyperparameters

    net = CNNNet(train.num_items,
                 embedding_dim=h['embedding_dim'],
                 kernel_width=h['kernel_width'],
                 dilation=h['dilation'],
                 num_layers=h['num_layers'],
                 nonlinearity=h['nonlinearity'],
                 residual_connections=h['residual'])

    model = ImplicitSequenceModel(loss=h['loss'],
                                  representation=net,
                                  batch_size=h['batch_size'],
                                  learning_rate=h['learning_rate'],
                                  l2=h['l2'],
                                  n_iter=h['n_iter'],
                                  use_cuda=CUDA,
                                  random_state=random_state)

    model.fit(train, verbose=True)

    test_mrr = sequence_mrr_score(model, test)
    val_mrr = sequence_mrr_score(model, validation)

    return model, test_mrr, val_mrr


def evaluate_lstm_model(hyperparameters, train, test, validation, random_state):

    h = hyperparameters

    model = ImplicitSequenceModel(loss=h['loss'],
                                  representation='lstm',
                                  batch_size=h['batch_size'],
                                  learning_rate=h['learning_rate'],
                                  l2=h['l2'],
                                  n_iter=h['n_iter'],
                                  use_cuda=CUDA,
                                  random_state=random_state)

    model.fit(train, verbose=True)

    test_mrr = sequence_mrr_score(model, test)
    val_mrr = sequence_mrr_score(model, validation)

    return model, test_mrr, val_mrr


def evaluate_pooling_model(hyperparameters, train, test, validation, random_state):

    h = hyperparameters

    model = ImplicitSequenceModel(loss=h['loss'],
                                  representation='pooling',
                                  batch_size=h['batch_size'],
                                  learning_rate=h['learning_rate'],
                                  l2=h['l2'],
                                  n_iter=h['n_iter'],
                                  use_cuda=CUDA,
                                  random_state=random_state)

    model.fit(train, verbose=True)

    test_mrr = sequence_mrr_score(model, test)
    val_mrr = sequence_mrr_score(model, validation)

    return model, test_mrr, val_mrr


def run(train, test, validation, ranomd_state, model_type):

    results = Results('{}_results.txt'.format(model_type))

    best_result = results.best()

    if model_type == 'pooling':
        eval_fnc, sample_fnc = (evaluate_pooling_model,
                                sample_pooling_hyperparameters)
    elif model_type == 'cnn':
        eval_fnc, sample_fnc = (evaluate_cnn_model,
                                sample_cnn_hyperparameters)
    elif model_type == 'lstm':
        eval_fnc, sample_fnc = (evaluate_lstm_model,
                                sample_lstm_hyperparameters)
    else:
        raise ValueError('Unknown model type')

    if best_result is not None:
        print('Best {} result: {}'.format(model_type, results.best()))
    
    
    trained_models = []
    for hyperparameters in sample_fnc(random_state, NUM_SAMPLES):

        if hyperparameters in results:
            continue

        print('Evaluating {}'.format(hyperparameters))

        (_model, test_mrr, val_mrr) = eval_fnc(hyperparameters,
                                       train,
                                       test,
                                       validation,
                                       random_state)
        
        trained_models.append(_model)
        
        print('Test MRR {} val MRR {}'.format(
            test_mrr.mean(), val_mrr.mean()
        ))

        results.save(hyperparameters, test_mrr.mean(), val_mrr.mean())

    return results, trained_models

Test training models:

import torch

torch.cuda.empty_cache()

max_sequence_length = 200
min_sequence_length = 20
step_size = 200
random_state = np.random.RandomState(100)

dataset = get_movielens_dataset('1M')

train, rest = user_based_train_test_split(dataset,
                                          random_state=random_state)
test, validation = user_based_train_test_split(rest,
                                               test_percentage=0.5,
                                               random_state=random_state)
train = train.to_sequence(max_sequence_length=max_sequence_length,
                          min_sequence_length=min_sequence_length,
                          step_size=step_size)
test = test.to_sequence(max_sequence_length=max_sequence_length,
                        min_sequence_length=min_sequence_length,
                        step_size=step_size)
validation = validation.to_sequence(max_sequence_length=max_sequence_length,
                                    min_sequence_length=min_sequence_length,
                                    step_size=step_size)
mode = 'lstm'

_results, _trained_models = run(train, test, validation, random_state, mode)

I used this for batch prediction:

_train_seq = _train.to_sequence(max_sequence_length=max_sequence_length,
                          min_sequence_length=min_sequence_length,
                          step_size=step_size)

However, here I saw negative prediction results:

Using 0 model for prediction...
[ 0.          0.00385581  0.02080305 ...,  0.00957463 -0.01620084
 -0.00543251]
Using 1 model for prediction...
[ 0.          4.48851442  2.41166139 ..., -3.30704689 -3.22007656
 -3.52712202]
Using 2 model for prediction...
[ 0.          0.16253436 -0.84116733 ..., -0.94019377 -1.20338225
 -1.54141724]
Using 3 model for prediction...
[ 0.          1.70468175 -0.78328896 ..., -1.90896392 -0.87334442
 -0.26563033]
Using 4 model for prediction...
[ 0.          1.87787497  0.42591834 ..., -0.74543238 -0.91004479
 -1.74974561]
Using 5 model for prediction...
[ 0.         -0.14088905 -1.42516923 ..., -0.74076736 -2.24554992
 -2.00794005]
Using 6 model for prediction...
[  0.          10.63965607   3.06833196 ...,  -8.56642342  -8.88048935
  -9.23009682]
Using 7 model for prediction...
[  0.           9.38321972  -2.44772935 ...,  -7.41398525 -12.05816555
  -3.69266152]
Using 8 model for prediction...
[ 0.          0.26318601  0.11055376 ..., -0.07141581 -0.07582763
 -0.14792749]
Using 9 model for prediction...
[  0.         -17.04875755   5.70450497 ..., -41.66604614 -40.23033524
 -34.51607895]

Is there anything I did wrong for prediction? Not sure why I saw negative predicted values.

Getting multiple predictions seems broken

I'm playing with implicit models using default BilinearNet as representation.

Given interactions test and some model model, one would expect

model.predict(test.user_ids)

will work, but it raises

RuntimeError: The expanded size of the tensor (<num_users>) must match the existing size (<...>) at non-singleton dimension 0

I think fixing this would require changing the way spotlight generates predictions. Currently when a we want predictions for user 7 and items [1, 2, 3], we actually call

model._net([7, 7, 7], [1, 2, 3])

To scale this to multiple users, e.g. users [7, 8] we could

Generate a tensor [[7,7,7], [8,8,8]] for user_ids, and call _net() as usual (some unsqueeze on item_embeddings would be needed for broadcasting)
Have a BilinearNet.predict_all method that would compute

x = th.LongTensor([7, 8])
y = th.LongTensor([1, 2, 3])

self.user_embeddings(x) @ self.item_embeddings(y).t()

Use torch.bmm() which, depending on the shape of the user_ids either computes equivalent 1. or 2.

I believe 2. is the cleanest and should be also the fastest.
@maciejkula what do you think?

Install Error

Using python 3.6 in conda environment. Getting the following error

conda install -c maciejkula -c soumith spotlight=0.1.2
Fetching package metadata .............

PackageNotFoundError: Package missing in current osx-64 channels:

spotlight 0.1.2*

Implement bloom embedding layers

As per https://arxiv.org/pdf/1706.03993.pdf.

Initial work in https://github.com/maciejkula/spotlight/compare/bloom_embeddings?expand=1

besides bpr

Just womdering technical hurddles
for warp loss.

mini batch feed

Hello
Just wondering, is there a way to do pipeline batch feed from text files to spotlight ? (because training data does fit in RAM).

Thanks

Formulation and usage questions

I have a few questions about using Spotlight for an item-item problem involving graded implicit feedback, pardon me if there is a better forum for such questions, I wasn't able to find one.

I work on a system with feedback in the form of clicks (aka page view), likes and purchases.
In this case obviously a purchase is substantially more desirable than a simple click.

Is there an obvious way to achieve this with Spotlight? Should I treat it as pure implicit and use the weights parameter to assign a greater weight to purchases than clicks?
Or is it more appropriate to treat it as a ratings prediction problem where the "ratings" are really pseudo-ratings assigned by me?

Also, does Spotlight have any support for cold-start? Or support for predicting for a new user in production based on that user's (previously unseen) history of implicit feedback? Or would lightfm maybe be a better fit for all of this?

Finally, if deployed in production can Spotlight models predict at reasonably low latency? Perhaps <100ms?

thanks very much for Spotlight. It's well-documented and the code is a joy to read.

Reproducing "binge" result

How to get similar result to https://github.com/maciejkula/binge as in your paper?

from spotlight.cross_validation import random_train_test_split
from spotlight.datasets.movielens import get_movielens_dataset
from spotlight.evaluation import mrr_score
from spotlight.factorization.implicit import ImplicitFactorizationModel
from spotlight.factorization.explicit import ExplicitFactorizationModel
from torch import optim
dataset = get_movielens_dataset(variant='1M')

train, test = random_train_test_split(dataset,test_percentage=0.2)
print train
print test

model = ImplicitFactorizationModel(n_iter=12,embedding_dim=32,batch_size=2048,
                                learning_rate=0.001,l2=1e-6,loss='adaptive_hinge',use_cuda=True)
model.fit(train,verbose=True)
mrr = mrr_score(model, test)
print mrr.mean()

The result is around 0.035 which is far from 0.07 in the paper. I was using the same hyper-parameters as in "movielens_1M_validation.log" in your binge repository, what am I missing?

Cannot pass in optimizer to models

Because the representations (i.e. classes that inherit from torch.nn.Module) are instantiated inside of the fit method for the models, it is impossible to pass in an instance of a PyTorch optimizer because one would need to already know the module parameters beforehand.

Perhaps one should pass in a string (e.g. 'sgd') and then that gets mapped to PyTorch optimizers? Or, one can pass in a reference to the optimizer class that then gets instantiated with the module parameters.

Clarity on documentation

I am using spotlight with movieLens dataset to build a simple recommender system. Finding it hard to understand what some of the functions actually returns?

For example -

mrr = mrr_score(model, test)

The documentation says "One score is given for every user with interactions in the test set". What does this score means? Is it predicting something?

When I print the result, I get

MRR Score
[ 0.04613095  0.01775374  0.02144497 ...,  0.00335534  0.04208428 0.01965703]

Also, what does model.predict(user_id) predicts?

Again when I print, I get below results:


User 3
[-21.33734512  22.07717514  12.95882034 ..., -13.99704361 -10.88421345 5.03288126]

Troubleshooting example.py in bloom_embeddings

Hi @maciejkula,

i have encountered errors of unable to download movielens_100k & out of bounds for int32 from example.py of bloom_embeddings

Complete error reference:

Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2862, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-27-caa8df7efe1c>", line 1, in <module>
    dataset = get_movielens_dataset(variant='100K')
  File "C:\ProgramData\Anaconda3\lib\site-packages\spotlight\datasets\movielens.py", line 70, in get_movielens_dataset
    return Interactions(*_get_movielens(url))
  File "C:\ProgramData\Anaconda3\lib\site-packages\spotlight\datasets\movielens.py", line 37, in _get_movielens
    extension))
  File "C:\ProgramData\Anaconda3\lib\site-packages\spotlight\datasets\_transport.py", line 36, in get_data
    download(url, dest_path)
  File "C:\ProgramData\Anaconda3\lib\site-packages\spotlight\datasets\_transport.py", line 19, in download
    req.raise_for_status()
  File "C:\ProgramData\Anaconda3\lib\site-packages\requests\models.py", line 935, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://github.com/maciejkula/recommender_datasets/releases/download/v0.2.0%5Cmovielens_100K.hdf5

During my debug of the link reference i found that this line of code : requests.get(url, stream=True) in _transport.py of source code is where the error is originating from.

The url passed appears to be correct but somehow requests.get gets back with 404. So i downloaded the file manually and placed it in the destination folder & then i came across below error, which is beyond my comprehension :)

C:\ProgramData\Anaconda3\lib\site-packages\h5py\__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
<Interactions dataset (944 users x 1683 items x 100000 interactions)>
Traceback (most recent call last):
  File "C:/BNPP/recoEngine/example.py", line 339, in <module>
    random_state=random_state)
  File "C:\ProgramData\Anaconda3\lib\site-packages\spotlight\cross_validation.py", line 146, in user_based_train_test_split
    seed = random_state.randint(minint, maxint)
  File "mtrand.pyx", line 991, in mtrand.RandomState.randint
ValueError: high is out of bounds for int32

Would appreciate your help in smoothly testing out the examples in spotlight.

Thanks,
Pavan

Conda can't find the spotlight module anymore

I installed the spotlight package using
conda install -c maciejkula -c soumith spotlight=0.1.2

as mentioned in the github link. Then I tested a sample program that worked fine. Just after 5 min, I am running the code again and getting error
ImportError: No module named 'spotlight.datasets'; 'spotlight' is not a package

I am dumbfounded as nothing changed. Any thoughts?

question about context_feature in the hybrid branch

checkout your hybrid_models branch and noticed you are implementing a context_feature in addition to user_feature, item_feature to handle meta_data. Can you please elaborate on what "context_feature" is in your mind ?
Also from your hybrid network, it looks like, metadata embedding is added to id_embedding, based your experience, have you tried other ways to combine these embeddings (e.g. concatenate(user_id_emb, user_feature).dot(concatenate(item_id_emb, item_feature)) + bias_terms)

        if self.user is not None:
            user_representation += self.user(user_features)
        if self.context is not None:
            user_representation += self.context(context_features)
        if self.item is not None:
            item_representation += self.item(item_features)

Thanks

Custom dataset, negative sample and metrics

Hi @maciejkula,

Thanks for your awesome work!
I am now using it for my own research and I found there are some small issues:

I think right now the project is not so friendly with custom datasets, as sometimes the users id is not start from 0, or even sometimes they are unique strings like "AMEVO2LY6VEJA". So I recommend to use a dictionary to convert unique user identifier to int. It will be nice if this project can have a build-in data converter.
For negative sampling here, it cannot guarantee the sampled item is truely "negative". In other words, it can be positive, though the probability is relatively low.
It will be good if spotlight can have more ranking metrics like "Precision", "Recall", "AUC" and "MAP".

Run evaluation in (mini-)batches

Hi, do you think it would be worth trying to transform the current evaluation code to compute scores in mini-batches, instead of one user at a time as it is done currently?

Note that this seems to be blocked by #92

Save Model

Hey,

I am using Spotlight for a school project. Is there any way I can train the model on some data and save it, and use it later for prediction?

Thanks!

Add support for contextual data input

Hi @maciejkula !

Great library! This is some seriously awesome work! 👍

I was wondering how to add additional features that are not stationary, such as context data that applies to the whole sequence.

Can you please tell me how to do this with the Spotlight library, or is this not supported yet?

Thanks!

user_based_train_test_split broken?

Currently trying with latest master branch to downsample the goodreads dataset by user:

from spotlight.cross_validation import random_train_test_split, user_based_train_test_split
from spotlight.datasets.goodbooks import get_goodbooks_dataset
dataset = get_goodbooks_dataset()

dataset.ratings = dataset.ratings.astype(np.float32)
train, test = spotlight.cross_validation.user_based_train_test_split(dataset,test_percentage=0.7)
print(train)

Train is: <Interactions dataset (53425 users x 10001 items x 1793373 interactions)>
The full dataset is: <Interactions dataset (53425 users x 10001 items x 5976479 interactions)>

i.e the sample only downsampled interactions, not users.

PackageNotFoundError: conda install on Windows

Thanks guys for developing (and open-sourcing) a PyTorch based Recommender Engine. We are interested to leverage it in one of our project.

While trying to install the package using conda on Windows, am getting PackageNotFoundError: Packages missing in current channels. Seems like the selected conda channel (maciejkula, pytorch) does not have the required packages (spotlight 0.1.4).

Has the channel changed? Do you guys support pip based install? If yes, would request to add pip based installation steps on GitHub's landing page (README.md).

Full trace:

(py3.6) C:\Users\Abhijeets>conda install -c maciejkula -c pytorch spotlight=0.1.4
Fetching package metadata .................
Solving package specifications:

PackageNotFoundError: Packages missing in current channels:

  - spotlight 0.1.4* -> pytorch 0.3.1.*

We have searched for the packages in the following channels:

  - https://conda.anaconda.org/maciejkula/win-64
  - https://conda.anaconda.org/maciejkula/noarch
  - https://conda.anaconda.org/pytorch/win-64
  - https://conda.anaconda.org/pytorch/noarch
  - https://repo.continuum.io/pkgs/main/win-64
  - https://repo.continuum.io/pkgs/main/noarch
  - https://repo.continuum.io/pkgs/free/win-64
  - https://repo.continuum.io/pkgs/free/noarch
  - https://repo.continuum.io/pkgs/r/win-64
  - https://repo.continuum.io/pkgs/r/noarch
  - https://repo.continuum.io/pkgs/pro/win-64
  - https://repo.continuum.io/pkgs/pro/noarch
  - https://repo.continuum.io/pkgs/msys2/win-64
  - https://repo.continuum.io/pkgs/msys2/noarch

releases/download not available

Hi,

I manually downloaded your stable 0.1.1 but with the example code from here I get this error.

HTTPError: 404 Client Error: Not Found for url: https://github.com/maciejkula/recommender_datasets/releases/download/0.1.0%5Cmovielens_100K.hdf5

Seems that the whole "releases" folder isn't present anymore.

Thanks for your help !

Expected or unexpected behavior regarding model fit method

I am trying spotlight and noticed two things which I am not sure if they are expected outcome or behavior or not.

Use case:
A simplistic example is as follows:

user_ids : [1,2,3..............................]
item_ids = [1,2,3,......................]

interactions can look like this (userid, itemid):
(1,5), (2,3), (1,1), (9,2), etc etc

Initially, I train the model on some historical data of user-item interactions. After that, each a user interacts with an item, I update the model by calling the fit method.

Case 1: Calling fit repeatedly for ImplicitSequenceModel

model.fit(sequences)
Here I get the following error:

Traceback (most recent call last):
  File "spotlight_rec.py", line 110, in <module>
    main()
  File "spotlight_rec.py", line 80, in main
    rt_counter = recommender.train(train_set)
  File "spotlight_rec.py", line 38, in train
    self.model.fit(self.sequences)
  File "/opt/conda/lib/python3.6/site-packages/spotlight/sequence/implicit.py", line 215, in fit
    self._check_input(sequences)
  File "/opt/conda/lib/python3.6/site-packages/spotlight/sequence/implicit.py", line 192, in _check_input
    raise ValueError('Maximum item id greater '
ValueError: Maximum item id greater than number of items in model.

The error happens exactly at this point:

def _check_input(self, item_ids):

        if isinstance(item_ids, int):
            item_id_max = item_ids
        else:
            item_id_max = item_ids.max()

        if item_id_max >= self._num_items:
            raise ValueError('Maximum item id greater '
                             'than number of items in model.')

This condition (if item_id_max >= self._num_items) can be true for us because a user could interact with an item which has an id number higher than the number of items.

Our current workarount: To reinitialize when we encounter this error.
Main question: Is it an expected behavior? and how can we handle this situation?

Case 2: Calling fit repeatedly or just one time for ImplicitFactorizationModel
In both these cases, I get an error with the len(item_ids) % mini-batchsize == 1. For example, if len(item_ids) is 17, and mini-batchsize is 8 then 17%8==1 leading to a value error. In my case, the mini-batchsize is 128. It happends at the last interation in creating user and item embedings in forward method of BilinearNet class.

Value Error:

Traceback (most recent call last):
  File "src/spotlight_recommender.py", line 136, in <module>
    main()
  File "src/spotlight_recommender.py", line 114, in main
    recommender.train(train_set)
  File "src/spotlight_recommender.py", line 38, in train
    self.model.fit(data)
  File "/home/user/anaconda3/lib/python3.5/site-packages/spotlight/factorization/implicit.py", line 233, in fit
    positive_prediction = self._net(user_var, item_var)
  File "/home/user/anaconda3/lib/python3.5/site-packages/torch/nn/modules/module.py", line 357, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/user/anaconda3/lib/python3.5/site-packages/spotlight/factorization/representations.py", line 92, in forward
    dot = (user_embedding * item_embedding).sum(1)
RuntimeError: dimension out of range (expected to be in range of [-1, 0], but got 1)

The error occurs in the following code:

def forward(self, user_ids, item_ids):
        """
        Compute the forward pass of the representation.
        Parameters
        ----------
        user_ids: tensor
            Tensor of user indices.
        item_ids: tensor
            Tensor of item indices.
        Returns
        -------
        predictions: tensor
            Tensor of predictions.
        """

        user_embedding = self.user_embeddings(user_ids)
        item_embedding = self.item_embeddings(item_ids)

        user_embedding = user_embedding.squeeze()
        item_embedding = item_embedding.squeeze()

        user_bias = self.user_biases(user_ids).squeeze()
        item_bias = self.item_biases(item_ids).squeeze()

        dot = (user_embedding * item_embedding).sum(1)

        return dot + user_bias + item_bias

The error occurs whenlen(item_ids)is 1. It works for all the rest values. After some logging, it goes like this. 128, 128, 128....................,1 which causes an error. When len(item_ids) is 1, I also notice that item_embedding looks different too.

Current work arrount: We just skip this iteration and retrain it in the next.
Main question: Are we doing something wrong if its an expected behavior? Why such situation can't be handled within this method?

Extra:
Normal item embeding when the size is 128:

[torch.FloatTensor of size 128x32]
 <class 'torch.autograd.variable.Variable'>
Variable containing:
-0.0585  0.1001  0.0464  ...   0.0846  0.1175 -0.1926
-0.0459  0.0973 -0.1781  ...   0.0870  0.0962 -0.1985
 0.1101  0.0706 -0.0717  ...   0.1948 -0.0446 -0.2401
          ...             ⋱             ...          
 0.1779  0.1666  0.1758  ...   0.1664  0.0789 -0.0398
 0.2808  0.2151  0.0562  ...   0.0679  0.0388  0.0283
-0.6652  0.6864 -0.8043  ...  -0.6818 -0.3306  0.4183

item embeding in the last round when the size is 1:

[torch.FloatTensor of size 128x32]
 <class 'torch.autograd.variable.Variable'>
Variable containing:
 0.4371
 0.5031
-0.1965
 0.5416
 0.2742
-0.4299
 0.4275
 0.5921
-0.5964
-0.3495
-0.3941
 0.4900
 0.5652
-0.4823
 0.4789
 0.4647
-0.4036
 0.3759
-0.4495
 0.1618
 0.4748
-0.5056
-0.6162
-0.5277
-0.4750
 0.4905
-0.3921
 0.4476
 0.4825
-0.4493
 0.5846
-0.5344

ModuleNotFoundError

When i try to import some modules:

from spotlight.datasets.movielens import get_movielens_dataset
Traceback (most recent call last):
File "", line 1, in
ModuleNotFoundError: No module named 'spotlight.datasets'

from spotlight.factorization.explicit import ExplicitFactorizationModel
Traceback (most recent call last):
File "", line 1, in
ModuleNotFoundError: No module named 'spotlight.factorization'

Incorporate user/item metadata

Hi,
Is there a way to incorporate user metadata (e.g, age, location) and item (e.g, tag) metadata in spotlight?

load own dataset

Hi, i have a dataset in a cvs format (https://raw.githubusercontent.com/juanremi/datasets-to-CF-recsys/master/bigquery-frutas-sin-items-no-vistos-ids.csv), the first row are the headers (user, item, rating).

So i modify the file https://raw.githubusercontent.com/maciejkula/spotlight/master/spotlight/datasets/movielens.py to load my own dataset:

import os
import torch
import numpy as np

from spotlight.datasets import _transport
from spotlight.interactions import Interactions

def _get_my_own(dataset):
    extension = '.csv'
    URL_PREFIX = '/home/user/htdocs/pruebas/datasets/myheaders/'

    data = np.genfromtxt(URL_PREFIX + dataset + extension, delimiter=',', names=True, dtype=(int, int, float))
    usuarios = data['user']
    items = data['item']
    ratings = data['rating']
    return (usuarios, items, ratings)

def get_my_own_dataset(myfile):
    """
    Returns
    -------

    Interactions: :class:`spotlight.interactions.Interactions`
        instance of the interactions class
    """

    url = myfile

    return Interactions(*_get_my_own(url))

dataset = get_my_own_dataset('bigquery-frutas-sin-items-no-vistos-ids')
print(dataset)

from spotlight.factorization.explicit import ExplicitFactorizationModel

model = ExplicitFactorizationModel(loss='regression',
                                   embedding_dim=128,  # latent dimensionality
                                   n_iter=10,  # number of epochs of training
                                   batch_size=1024,  # minibatch size
                                   l2=1e-9,  # strength of L2 regularization
                                   learning_rate=1e-3,
                                   use_cuda=torch.cuda.is_available())

from spotlight.cross_validation import random_train_test_split

train, test = random_train_test_split(dataset, random_state=np.random.RandomState(42))

print('Split into \n {} and \n {}.'.format(train, test))

model.fit(train, verbose=True)

but when i run the output is:

<Interactions dataset (14 users x 17 items x 139 interactions)>
Split into 
 <Interactions dataset (14 users x 17 items x 111 interactions)> and 
 <Interactions dataset (14 users x 17 items x 28 interactions)>.
Traceback (most recent call last):
  File "condadata.py", line 52, in <module>
    model.fit(train, verbose=True)
  File "/var/www/miniconda2/envs/conda1/lib/python2.7/site-packages/spotlight/factorization/explicit.py", line 172, in fit
    loss = loss_fnc(ratings_var, predictions)
  File "/var/www/miniconda2/envs/conda1/lib/python2.7/site-packages/spotlight/losses.py", line 191, in regression_loss
    return ((observed_ratings - predicted_ratings) ** 2).mean()
  File "/var/www/miniconda2/envs/conda1/lib/python2.7/site-packages/torch/autograd/variable.py", line 752, in __sub__
    return self.sub(other)
  File "/var/www/miniconda2/envs/conda1/lib/python2.7/site-packages/torch/autograd/variable.py", line 296, in sub
    return self._sub(other, False)
  File "/var/www/miniconda2/envs/conda1/lib/python2.7/site-packages/torch/autograd/variable.py", line 290, in _sub
    return Sub(inplace=inplace)(self, other)
  File "/var/www/miniconda2/envs/conda1/lib/python2.7/site-packages/torch/autograd/_functions/basic_ops.py", line 34, in forward
    return a.sub(b)
TypeError: sub received an invalid combination of arguments - got (torch.FloatTensor), but expected one of:
 * (float value)
      didn't match because some of the arguments have invalid types: (torch.FloatTensor)
 * (torch.DoubleTensor other)
      didn't match because some of the arguments have invalid types: (torch.FloatTensor)
 * (float value, torch.DoubleTensor other)

any idea?

Unit tests on Windows

When running the tests suite of the master branch on Windows 10, multiple tests currently fail,

pytest -sv tests/
[...]
=============== 34 failed, 24 passed, 3 error in 24.89 seconds ================

full output can be found here.

Two most frequent errors are,

ValueError in randint

 ..\spotlight\cross_validation.py:146: in user_based_train_test_split
     seed = random_state.randint(minint, maxint)
 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

 >   ???
 E   ValueError: high is out of bounds for int32

 mtrand.pyx:975: ValueError

and dtype mismatch in torch.index_select,

    def _get_hashed_indices(self, original_indices):
        # [...]
        hashed_indices = torch.index_select(self._hashes,
                                            0,
>                                           original_indices.squeeze())
E       TypeError: torch.index_select received an invalid combination of arguments - got (torch.LongTensor, int, !torch.IntTensor!), but expected (torch.LongTensor source, int dim, torch.LongTensor index)

..\spotlight\layers.py:204: TypeError

Run on the master branch with the following installed depedencies,

import platform; print(platform.platform())
import sys; print("Python", sys.version)
import numpy; print("NumPy", numpy.__version__)
import scipy; print("SciPy", scipy.__version__)
import sklearn; print("Scikit-Learn", sklearn.__version__)
import torch; print("PyTorch", torch.__version__)
print("CUDA available:", torch.cuda.is_available())
print("CUDA version:", torch.version.cuda)
import spotlight; print("spotlight", spotlight.__version__)

Windows-10-10.0.14393-SP0
Python 3.6.3 |Anaconda, Inc.| (default, Nov  8 2017, 15:10:56) [MSC v.1900 64 bit (AMD64)]
NumPy 1.12.1
SciPy 1.0.0
Scikit-Learn 0.19.1
PyTorch 0.3.0b0+591e73e
CUDA available: True
CUDA version: 8.0
spotlight v0.1.3

PyTorch for Windows was installed from the peterjc123 conda channel, as described in pytorch/pytorch#494 (comment)

When installed on Linux with the same approach (though without GPU support), all tests pass.

item-item similarity

I'm trying to implement both recommendations and item-item similarities based on an ImplicitFactorizationModel trained on clickstream data. The former is trivial using the predict method, but the latter I'm not quite sure about. Looking at the code, I could obtain the item embeddings in a similar way to how it's done in _components.py and representations.py, and then apply a similarity measure to the resulting torch tensors.

Is this a sensible approach / Is there a better way to do this / Is it even possible?

Tensorflow Implementation

Is anyone interested in having the same feature set available in TF?

I'm considering implementing at least implicit matrix factorization in TF, want to know if it's worth making PRs or not.

Conda Install Issue (Windows 10)

Hi! Thank you for Spotlight.

I have a similar problem to the issue #80

My OS is Windows 10 x64.

First, i must admit i am new with Anaconda. I have a 'portable' installation Anaconda (with Anaconda3-5.1.0-Windows-x86_64.exe).

When i try:

conda install -c maciejkula -c pytorch spotlight=0.1.4

i get;

u:\Python\Anaconda3>conda install -c maciejkula -c pytorch spotlight=0.1.4
Solving environment: failed

PackagesNotFoundError: The following packages are not available from current channels:

  - spotlight=0.1.4
  - pytorch=0.3.1

Current channels:

  - https://conda.anaconda.org/maciejkula/win-64
  - https://conda.anaconda.org/maciejkula/noarch
  - https://conda.anaconda.org/pytorch/win-64
  - https://conda.anaconda.org/pytorch/noarch
  - https://repo.continuum.io/pkgs/main/win-64
  - https://repo.continuum.io/pkgs/main/noarch
  - https://repo.continuum.io/pkgs/free/win-64
  - https://repo.continuum.io/pkgs/free/noarch
  - https://repo.continuum.io/pkgs/r/win-64
  - https://repo.continuum.io/pkgs/r/noarch
  - https://repo.continuum.io/pkgs/pro/win-64
  - https://repo.continuum.io/pkgs/pro/noarch
  - https://repo.continuum.io/pkgs/msys2/win-64
  - https://repo.continuum.io/pkgs/msys2/noarch

u:\Python\Anaconda3>

I have not created an environment... Is it necessary?

Perhaps, is it necessary to add channels or so?

Install Spotlight on win 64 using conda

I have created a py 3.6 env.

Please let me know how to install it .

conda install -c maciejkula -c pytorch spotlight=0.1.4 is not working

conda install -c maciejkula -c pytorch -c peterjc123 spotlight=0.1.4 even this fails

maciejkula / spotlight Goto Github PK

spotlight's Issues

Recommend Projects

Recommend Topics

Recommend Org