maciejkula / spotlight Goto Github PK
View Code? Open in Web Editor NEWDeep recommender models using PyTorch.
License: MIT License
Deep recommender models using PyTorch.
License: MIT License
I ran this command:conda install -c maciejkula -c pytorch -c peterjc123 spotlight=0.1.4
but showed the error
conda install -c maciejkula -c pytorch -c peterjc123 spotlight=0.1.3
Fetching package metadata ...........
Solving package specifications:
PackageNotFoundError: Packages missing in current channels:
I ran into a very odd segmentation fault error. This could very well be a PyTorch bug, but I thought I'd bring it up here, first. I've produced a minimal example at the bottom of this issue.
So far, I know that the fault happens at the loss.backward()
call in model.fit()
. The fault only seems to happen under the combination of two conditions (that I can find, so far):
sparse=True
.(BTW, I pass in an SGD optimizer because that seems to be the only one that works right now with sparse embeddings)
I'm using pandas version 0.20.3 from conda, the latest spotlight from master, and PyTorch 0.2.0 from conda. I'd love to know if others can reproduce this.
As I said, this could very well be a PyTorch bug, but, if others run into this, it'll be helpful to have this issue as a reference.
import pandas as pd
import numpy as np
import torch
from spotlight.interactions import Interactions
from spotlight.factorization.implicit import ImplicitFactorizationModel
user_ids = [2471, 5808, 3281, 4086, 6293, 8970, 11828, 3281]
item_ids = [1583, 57, 6963, 867, 8099, 10991, 24, 800]
num_users = 15274
num_items = 25655
train = Interactions(np.array(user_ids, dtype=np.int64),
np.array(item_ids, dtype=np.int64),
num_users=num_users,
num_items=num_items)
def optimizer_func(params, lr=0.01):
return torch.optim.SGD(params, lr=lr)
RANDOM_STATE = np.random.RandomState(42)
model = ImplicitFactorizationModel(loss='bpr',
embedding_dim=32,
batch_size=4,
n_iter=1,
use_cuda=False,
optimizer_func=optimizer_func,
sparse=True,
random_state=RANDOM_STATE)
# Fault
model.fit(train, verbose=True)
Hi @maciejkula,
Is there a LightFM sample_weight equivalent in Spotlight?
The scenario is there are two types of ratings in the problem:
I believe this is good cause to utilize sample_weights, but please correct me if I'm wrong.
For folks stumbling upon this, sample weight in LightFM is defined as:
sample_weight: np.float32 coo_matrix of shape [n_users, n_items], optional
matrix with entries expressing weights of individual
interactions from the interactions matrix.
Its row and col arrays must be the same as
those of the interactions matrix. For memory
efficiency its possible to use the same arrays
for both weights and interaction matrices.
Defaults to weight 1.0 for all interactions.
Not implemented for the k-OS loss.
Searching Spotlight didn't reveal interaction weights used anywhere except in cross_validation.py like:
weights=_index_or_none(interactions.weights...
Whereas I was hoping to see something more like LightFM's _lightfm_fast.pyx.template:
loss = weight * (prediction - y)
Is there a reason this is missing?
How hard would it be to add to the ImplicitFactorization and ImplicitSequence models?
If I'm not mistaken, MRR score is a mean of reciprocal rank
scores over a set of examples, where reciprocal rank
is given as 1 / (rank of the first positive item)
.
Currently implementation in spotlight
takes a mean over ranks of positive items. I think this is not correct. It is also inconsistent with lightfm
. It also makes it impossible to have mrr
score of 1
even with perfect model (even if positive items are ranked [1, 2, 3]
, taking the mean gives result lower than 1).
Can I send a PR to fix this?
As far as i understand the model can predict a sequence of items for a given item/items in sequence, but how to get personalized predictions for a given user?
Is it possible to extract the learned user embeddings or item embeddings from the model which can be used to generate a predicted weight by doing a dot product?
I am on the master branch of pytorch and using also the master brach of the spotlight.
Test code:
from spotlight.cross_validation import random_train_test_split
from spotlight.datasets.movielens import get_movielens_dataset
from spotlight.evaluation import mrr_score
from spotlight.factorization.implicit import ImplicitFactorizationModel
dataset = get_movielens_dataset(variant='1M')
train, test = random_train_test_split(dataset,test_percentage=0.01)
print train
print test
model = ImplicitFactorizationModel(n_iter=1,embedding_dim=32,
loss='bpr', use_cuda=True)
model.fit(train,verbose=True)
mrr = mrr_score(model, test)
print mrr.mean()
Put a breakpoint at mrr_score
def mrr_score(model, test, train=None):
"""
Compute mean reciprocal rank (MRR) scores. One score
is given for every user with interactions in the test
set, representing the mean reciprocal rank of all their
test items.
Parameters
----------
model: fitted instance of a recommender model
The model to evaluate.
test: :class:`spotlight.interactions.Interactions`
Test interactions.
train: :class:`spotlight.interactions.Interactions`, optional
Train interactions. If supplied, scores of known
interactions will be set to very low values and so not
affect the MRR.
Returns
-------
mrr scores: numpy array of shape (num_users,)
Array of MRR scores for each user in test.
"""
test = test.tocsr()
if train is not None:
train = train.tocsr()
mrrs = []
for user_id, row in enumerate(test):
if not len(row.indices):
continue
predictions = -model.predict(user_id)
import pdb
pdb.set_trace()
if train is not None:
predictions[train[user_id].indices] = FLOAT_MAX
mrr = (1.0 / st.rankdata(predictions)[row.indices]).mean()
mrrs.append(mrr)
return np.array(mrrs)
Let's run and debug
python test_spotlight.py
<Interactions dataset (6041 users x 3953 items x 990206 interactions)>
<Interactions dataset (6041 users x 3953 items x 10003 interactions)>
Epoch 0: loss 0.136451368182
> /home/sunlin/anaconda2/lib/python2.7/site-packages/spotlight/evaluation.py(48)mrr_score()
-> if train is not None:
(Pdb) predictions.shape
(15626209,)
(Pdb)
The correct shape is 3953, but the wrong shape 15626209=3953*3953
The root cause is the BilinearNet model, a fix is to change
return dot + user_bias + item_bias
To
return dot.view(-1,1) + user_bias + item_bias
Is this to do with my version of pytorch?
Hi,
I ran
conda install --prefix=venv3 -c maciejkula -c pytorch spotlight=0.1.2
and it returned
Fetching package metadata ...............
Solving package specifications:
PackageNotFoundError: Packages missing in current channels:
- spotlight 0.1.2* -> pytorch 0.2.0
We have searched for the packages in the following channels:
- https://conda.anaconda.org/maciejkula/osx-64
- https://conda.anaconda.org/maciejkula/noarch
- https://conda.anaconda.org/pytorch/osx-64
- https://conda.anaconda.org/pytorch/noarch
- https://repo.continuum.io/pkgs/main/osx-64
- https://repo.continuum.io/pkgs/main/noarch
- https://repo.continuum.io/pkgs/free/osx-64
- https://repo.continuum.io/pkgs/free/noarch
- https://repo.continuum.io/pkgs/r/osx-64
- https://repo.continuum.io/pkgs/r/noarch
- https://repo.continuum.io/pkgs/pro/osx-64
- https://repo.continuum.io/pkgs/pro/noarch
I was able to install it a few days ago with the above command. Wonder if something changed?
I created a new project from scratch, created a new conda environment and ran
conda install -c maciejkula -c pytorch spotlight=0.1.2
and got the same error.
Do you have any plans to complete warp loss?
@maciejkula I guess we should remove the items found the training dataset before the negative sampling. Otherwise, it might make the learning less effective?
We'd like to be able to run test on GPU (even if that is only manual for now).
This url is non existent: https://github.com/maciejkula/recommender_datasets/releases/download/
It could be useful to upload spotlight to PyPi, particularly since it's a pure Python package, and besides installing the PyTorch dependency, there is no advantage in installing it from a custom conda channel.
Hi,
I was looking at LightFM and saw item and user metadata being used for recommendations. This is really cool. Just wondering if such functionality is in the roadmap for spotlight
?
Hi!
Very cool project.
There are some potential improvements to sequential model found in Improved Recurrent Neural Networks for Session-based Recommendations.
Randomly dropping items from sequences helps to avoid over-fitting for website structure and improves MRR by ~8% in my experiments with proprietary click-stream data. If done per-batch at training phase the memory overhead could be avoided.
Faster recurrent units like GRU, QRNN or SRU could strike a better performance/accuracy trade-off than causal convolution model.
Hi
I saw the "Add model serialization" is on the Trello to do list.
If I can serialize the model, can I just reload the old model and just continue the training with the new interactions data? But I guess there would be learning rate problem with the Adam optimizer at least. What do you do in practice? Can you recommend me something to read?
Thank you!
Very good project. However, I got an issue when I try making batch prediction by Implicit Sequence Model:
Here I modified your example codes a little bit:
import hashlib
import json
import os
import shutil
import sys
import numpy as np
from sklearn.model_selection import ParameterSampler
from spotlight.datasets.movielens import get_movielens_dataset
from spotlight.cross_validation import user_based_train_test_split
from spotlight.sequence.implicit import ImplicitSequenceModel
from spotlight.sequence.representations import CNNNet
from spotlight.evaluation import sequence_mrr_score
CUDA = True
NUM_SAMPLES = 10
LEARNING_RATES = [1e-3, 1e-2, 5 * 1e-2, 1e-1]
LOSSES = ['bpr', 'hinge', 'adaptive_hinge', 'pointwise']
BATCH_SIZE = [8, 16, 32, 256]
EMBEDDING_DIM = [8, 16, 32, 64, 128, 256]
N_ITER = list(range(5, 20))
L2 = [1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 0.0]
class Results:
def __init__(self, filename):
self._filename = filename
open(self._filename, 'a+')
def _hash(self, x):
return hashlib.md5(json.dumps(x, sort_keys=True).encode('utf-8')).hexdigest()
def save(self, hyperparams, test_mrr, validation_mrr):
result = {'test_mrr': test_mrr,
'validation_mrr': validation_mrr,
'hash': self._hash(hyperparams)}
result.update(hyperparams)
with open(self._filename, 'a+') as out:
out.write(json.dumps(result) + '\n')
def best(self):
results = sorted([x for x in self],
key=lambda x: -x['test_mrr'])
if results:
return results[0]
else:
return None
def __getitem__(self, hyperparams):
params_hash = self._hash(hyperparams)
with open(self._filename, 'r+') as fle:
for line in fle:
datum = json.loads(line)
if datum['hash'] == params_hash:
del datum['hash']
return datum
raise KeyError
def __contains__(self, x):
try:
self[x]
return True
except KeyError:
return False
def __iter__(self):
with open(self._filename, 'r+') as fle:
for line in fle:
datum = json.loads(line)
del datum['hash']
yield datum
def sample_cnn_hyperparameters(random_state, num):
space = {
'n_iter': N_ITER,
'batch_size': BATCH_SIZE,
'l2': L2,
'learning_rate': LEARNING_RATES,
'loss': LOSSES,
'embedding_dim': EMBEDDING_DIM,
'kernel_width': [3, 5, 7],
'num_layers': list(range(2, 10)),
'dilation_multiplier': [1, 2],
'nonlinearity': ['tanh', 'relu'],
'residual': [True, False]
}
sampler = ParameterSampler(space,
n_iter=num,
random_state=random_state)
for params in sampler:
params['dilation'] = list(params['dilation_multiplier'] ** (i % 8)
for i in range(params['num_layers']))
yield params
def sample_lstm_hyperparameters(random_state, num):
space = {
'n_iter': N_ITER,
'batch_size': BATCH_SIZE,
'l2': L2,
'learning_rate': LEARNING_RATES,
'loss': LOSSES,
'embedding_dim': EMBEDDING_DIM,
}
sampler = ParameterSampler(space,
n_iter=num,
random_state=random_state)
for params in sampler:
yield params
def sample_pooling_hyperparameters(random_state, num):
space = {
'n_iter': N_ITER,
'batch_size': BATCH_SIZE,
'l2': L2,
'learning_rate': LEARNING_RATES,
'loss': LOSSES,
'embedding_dim': EMBEDDING_DIM,
}
sampler = ParameterSampler(space,
n_iter=num,
random_state=random_state)
for params in sampler:
yield params
def evaluate_cnn_model(hyperparameters, train, test, validation, random_state):
h = hyperparameters
net = CNNNet(train.num_items,
embedding_dim=h['embedding_dim'],
kernel_width=h['kernel_width'],
dilation=h['dilation'],
num_layers=h['num_layers'],
nonlinearity=h['nonlinearity'],
residual_connections=h['residual'])
model = ImplicitSequenceModel(loss=h['loss'],
representation=net,
batch_size=h['batch_size'],
learning_rate=h['learning_rate'],
l2=h['l2'],
n_iter=h['n_iter'],
use_cuda=CUDA,
random_state=random_state)
model.fit(train, verbose=True)
test_mrr = sequence_mrr_score(model, test)
val_mrr = sequence_mrr_score(model, validation)
return model, test_mrr, val_mrr
def evaluate_lstm_model(hyperparameters, train, test, validation, random_state):
h = hyperparameters
model = ImplicitSequenceModel(loss=h['loss'],
representation='lstm',
batch_size=h['batch_size'],
learning_rate=h['learning_rate'],
l2=h['l2'],
n_iter=h['n_iter'],
use_cuda=CUDA,
random_state=random_state)
model.fit(train, verbose=True)
test_mrr = sequence_mrr_score(model, test)
val_mrr = sequence_mrr_score(model, validation)
return model, test_mrr, val_mrr
def evaluate_pooling_model(hyperparameters, train, test, validation, random_state):
h = hyperparameters
model = ImplicitSequenceModel(loss=h['loss'],
representation='pooling',
batch_size=h['batch_size'],
learning_rate=h['learning_rate'],
l2=h['l2'],
n_iter=h['n_iter'],
use_cuda=CUDA,
random_state=random_state)
model.fit(train, verbose=True)
test_mrr = sequence_mrr_score(model, test)
val_mrr = sequence_mrr_score(model, validation)
return model, test_mrr, val_mrr
def run(train, test, validation, ranomd_state, model_type):
results = Results('{}_results.txt'.format(model_type))
best_result = results.best()
if model_type == 'pooling':
eval_fnc, sample_fnc = (evaluate_pooling_model,
sample_pooling_hyperparameters)
elif model_type == 'cnn':
eval_fnc, sample_fnc = (evaluate_cnn_model,
sample_cnn_hyperparameters)
elif model_type == 'lstm':
eval_fnc, sample_fnc = (evaluate_lstm_model,
sample_lstm_hyperparameters)
else:
raise ValueError('Unknown model type')
if best_result is not None:
print('Best {} result: {}'.format(model_type, results.best()))
trained_models = []
for hyperparameters in sample_fnc(random_state, NUM_SAMPLES):
if hyperparameters in results:
continue
print('Evaluating {}'.format(hyperparameters))
(_model, test_mrr, val_mrr) = eval_fnc(hyperparameters,
train,
test,
validation,
random_state)
trained_models.append(_model)
print('Test MRR {} val MRR {}'.format(
test_mrr.mean(), val_mrr.mean()
))
results.save(hyperparameters, test_mrr.mean(), val_mrr.mean())
return results, trained_models
Test training models:
import torch
torch.cuda.empty_cache()
max_sequence_length = 200
min_sequence_length = 20
step_size = 200
random_state = np.random.RandomState(100)
dataset = get_movielens_dataset('1M')
train, rest = user_based_train_test_split(dataset,
random_state=random_state)
test, validation = user_based_train_test_split(rest,
test_percentage=0.5,
random_state=random_state)
train = train.to_sequence(max_sequence_length=max_sequence_length,
min_sequence_length=min_sequence_length,
step_size=step_size)
test = test.to_sequence(max_sequence_length=max_sequence_length,
min_sequence_length=min_sequence_length,
step_size=step_size)
validation = validation.to_sequence(max_sequence_length=max_sequence_length,
min_sequence_length=min_sequence_length,
step_size=step_size)
mode = 'lstm'
_results, _trained_models = run(train, test, validation, random_state, mode)
I used this for batch prediction:
_train_seq = _train.to_sequence(max_sequence_length=max_sequence_length,
min_sequence_length=min_sequence_length,
step_size=step_size)
However, here I saw negative prediction results:
Using 0 model for prediction...
[ 0. 0.00385581 0.02080305 ..., 0.00957463 -0.01620084
-0.00543251]
Using 1 model for prediction...
[ 0. 4.48851442 2.41166139 ..., -3.30704689 -3.22007656
-3.52712202]
Using 2 model for prediction...
[ 0. 0.16253436 -0.84116733 ..., -0.94019377 -1.20338225
-1.54141724]
Using 3 model for prediction...
[ 0. 1.70468175 -0.78328896 ..., -1.90896392 -0.87334442
-0.26563033]
Using 4 model for prediction...
[ 0. 1.87787497 0.42591834 ..., -0.74543238 -0.91004479
-1.74974561]
Using 5 model for prediction...
[ 0. -0.14088905 -1.42516923 ..., -0.74076736 -2.24554992
-2.00794005]
Using 6 model for prediction...
[ 0. 10.63965607 3.06833196 ..., -8.56642342 -8.88048935
-9.23009682]
Using 7 model for prediction...
[ 0. 9.38321972 -2.44772935 ..., -7.41398525 -12.05816555
-3.69266152]
Using 8 model for prediction...
[ 0. 0.26318601 0.11055376 ..., -0.07141581 -0.07582763
-0.14792749]
Using 9 model for prediction...
[ 0. -17.04875755 5.70450497 ..., -41.66604614 -40.23033524
-34.51607895]
Is there anything I did wrong for prediction? Not sure why I saw negative predicted values.
I'm playing with implicit models using default BilinearNet
as representation.
Given interactions test
and some model model
, one would expect
model.predict(test.user_ids)
will work, but it raises
RuntimeError: The expanded size of the tensor (<num_users>) must match the existing size (<...>) at non-singleton dimension 0
I think fixing this would require changing the way spotlight
generates predictions. Currently when a we want predictions for user 7
and items [1, 2, 3]
, we actually call
model._net([7, 7, 7], [1, 2, 3])
To scale this to multiple users, e.g. users [7, 8]
we could
[[7,7,7], [8,8,8]]
for user_ids
, and call _net()
as usual (some unsqueeze
on item_embeddings
would be needed for broadcasting)BilinearNet.predict_all
method that would computex = th.LongTensor([7, 8])
y = th.LongTensor([1, 2, 3])
self.user_embeddings(x) @ self.item_embeddings(y).t()
torch.bmm()
which, depending on the shape of the user_ids
either computes equivalent 1.
or 2.
I believe 2. is the cleanest and should be also the fastest.
@maciejkula what do you think?
Using python 3.6 in conda environment. Getting the following error
conda install -c maciejkula -c soumith spotlight=0.1.2
Fetching package metadata .............
PackageNotFoundError: Package missing in current osx-64 channels:
Just womdering technical hurddles
for warp loss.
Hello
Just wondering, is there a way to do pipeline batch feed from text files to spotlight ? (because training data does fit in RAM).
Thanks
I have a few questions about using Spotlight for an item-item problem involving graded implicit feedback, pardon me if there is a better forum for such questions, I wasn't able to find one.
I work on a system with feedback in the form of clicks (aka page view), likes and purchases.
In this case obviously a purchase is substantially more desirable than a simple click.
Is there an obvious way to achieve this with Spotlight? Should I treat it as pure implicit and use the weights
parameter to assign a greater weight to purchases than clicks?
Or is it more appropriate to treat it as a ratings prediction problem where the "ratings" are really pseudo-ratings assigned by me?
Also, does Spotlight have any support for cold-start? Or support for predicting for a new user in production based on that user's (previously unseen) history of implicit feedback? Or would lightfm maybe be a better fit for all of this?
Finally, if deployed in production can Spotlight models predict at reasonably low latency? Perhaps <100ms?
thanks very much for Spotlight. It's well-documented and the code is a joy to read.
How to get similar result to https://github.com/maciejkula/binge as in your paper?
from spotlight.cross_validation import random_train_test_split
from spotlight.datasets.movielens import get_movielens_dataset
from spotlight.evaluation import mrr_score
from spotlight.factorization.implicit import ImplicitFactorizationModel
from spotlight.factorization.explicit import ExplicitFactorizationModel
from torch import optim
dataset = get_movielens_dataset(variant='1M')
train, test = random_train_test_split(dataset,test_percentage=0.2)
print train
print test
model = ImplicitFactorizationModel(n_iter=12,embedding_dim=32,batch_size=2048,
learning_rate=0.001,l2=1e-6,loss='adaptive_hinge',use_cuda=True)
model.fit(train,verbose=True)
mrr = mrr_score(model, test)
print mrr.mean()
The result is around 0.035 which is far from 0.07 in the paper. I was using the same hyper-parameters as in "movielens_1M_validation.log" in your binge repository, what am I missing?
Because the representations (i.e. classes that inherit from torch.nn.Module
) are instantiated inside of the fit
method for the models, it is impossible to pass in an instance of a PyTorch optimizer because one would need to already know the module parameters beforehand.
Perhaps one should pass in a string (e.g. 'sgd') and then that gets mapped to PyTorch optimizers? Or, one can pass in a reference to the optimizer class that then gets instantiated with the module parameters.
Hi
I am using spotlight with movieLens dataset to build a simple recommender system. Finding it hard to understand what some of the functions actually returns?
For example -
mrr = mrr_score(model, test)
The documentation says "One score is given for every user with interactions in the test set
". What does this score means? Is it predicting something?
When I print the result, I get
MRR Score
[ 0.04613095 0.01775374 0.02144497 ..., 0.00335534 0.04208428 0.01965703]
Also, what does model.predict(user_id)
predicts?
Again when I print, I get below results:
User 3
[-21.33734512 22.07717514 12.95882034 ..., -13.99704361 -10.88421345 5.03288126]
Hi @maciejkula,
i have encountered errors of unable to download movielens_100k
& out of bounds for int32
from example.py
of bloom_embeddings
Complete error reference:
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2862, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-27-caa8df7efe1c>", line 1, in <module>
dataset = get_movielens_dataset(variant='100K')
File "C:\ProgramData\Anaconda3\lib\site-packages\spotlight\datasets\movielens.py", line 70, in get_movielens_dataset
return Interactions(*_get_movielens(url))
File "C:\ProgramData\Anaconda3\lib\site-packages\spotlight\datasets\movielens.py", line 37, in _get_movielens
extension))
File "C:\ProgramData\Anaconda3\lib\site-packages\spotlight\datasets\_transport.py", line 36, in get_data
download(url, dest_path)
File "C:\ProgramData\Anaconda3\lib\site-packages\spotlight\datasets\_transport.py", line 19, in download
req.raise_for_status()
File "C:\ProgramData\Anaconda3\lib\site-packages\requests\models.py", line 935, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://github.com/maciejkula/recommender_datasets/releases/download/v0.2.0%5Cmovielens_100K.hdf5
During my debug of the link reference i found that this line of code : requests.get(url, stream=True)
in _transport.py
of source code is where the error is originating from.
The url passed appears to be correct but somehow requests.get gets back with 404
. So i downloaded the file manually and placed it in the destination folder & then i came across below error, which is beyond my comprehension :)
C:\ProgramData\Anaconda3\lib\site-packages\h5py\__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
<Interactions dataset (944 users x 1683 items x 100000 interactions)>
Traceback (most recent call last):
File "C:/BNPP/recoEngine/example.py", line 339, in <module>
random_state=random_state)
File "C:\ProgramData\Anaconda3\lib\site-packages\spotlight\cross_validation.py", line 146, in user_based_train_test_split
seed = random_state.randint(minint, maxint)
File "mtrand.pyx", line 991, in mtrand.RandomState.randint
ValueError: high is out of bounds for int32
Would appreciate your help in smoothly testing out the examples in spotlight.
Thanks,
Pavan
I installed the spotlight package using
conda install -c maciejkula -c soumith spotlight=0.1.2
as mentioned in the github link. Then I tested a sample program that worked fine. Just after 5 min, I am running the code again and getting error
ImportError: No module named 'spotlight.datasets'; 'spotlight' is not a package
I am dumbfounded as nothing changed. Any thoughts?
checkout your hybrid_models
branch and noticed you are implementing a context_feature in addition to user_feature, item_feature to handle meta_data. Can you please elaborate on what "context_feature" is in your mind ?
Also from your hybrid network, it looks like, metadata embedding is added to id_embedding, based your experience, have you tried other ways to combine these embeddings (e.g. concatenate(user_id_emb, user_feature).dot(concatenate(item_id_emb, item_feature)) + bias_terms)
if self.user is not None:
user_representation += self.user(user_features)
if self.context is not None:
user_representation += self.context(context_features)
if self.item is not None:
item_representation += self.item(item_features)
Thanks
Hi @maciejkula,
Thanks for your awesome work!
I am now using it for my own research and I found there are some small issues:
I think right now the project is not so friendly with custom datasets, as sometimes the users id is not start from 0, or even sometimes they are unique strings like "AMEVO2LY6VEJA". So I recommend to use a dictionary to convert unique user identifier to int. It will be nice if this project can have a build-in data converter.
For negative sampling here, it cannot guarantee the sampled item is truely "negative". In other words, it can be positive, though the probability is relatively low.
It will be good if spotlight can have more ranking metrics like "Precision", "Recall", "AUC" and "MAP".
Hi, do you think it would be worth trying to transform the current evaluation
code to compute scores in mini-batches, instead of one user at a time as it is done currently?
Note that this seems to be blocked by #92
Hey,
I am using Spotlight for a school project. Is there any way I can train the model on some data and save it, and use it later for prediction?
Thanks!
Hi @maciejkula !
Great library! This is some seriously awesome work! ๐
I was wondering how to add additional features that are not stationary, such as context data that applies to the whole sequence.
Can you please tell me how to do this with the Spotlight library, or is this not supported yet?
Thanks!
Currently trying with latest master branch to downsample the goodreads dataset by user:
from spotlight.cross_validation import random_train_test_split, user_based_train_test_split
from spotlight.datasets.goodbooks import get_goodbooks_dataset
dataset = get_goodbooks_dataset()
dataset.ratings = dataset.ratings.astype(np.float32)
train, test = spotlight.cross_validation.user_based_train_test_split(dataset,test_percentage=0.7)
print(train)
Train is: <Interactions dataset (53425 users x 10001 items x 1793373 interactions)>
The full dataset is: <Interactions dataset (53425 users x 10001 items x 5976479 interactions)>
i.e the sample only downsampled interactions, not users.
Thanks guys for developing (and open-sourcing) a PyTorch based Recommender Engine. We are interested to leverage it in one of our project.
While trying to install the package using conda
on Windows, am getting PackageNotFoundError: Packages missing in current channels
. Seems like the selected conda channel (maciejkula, pytorch) does not have the required packages (spotlight 0.1.4).
Has the channel changed? Do you guys support pip
based install? If yes, would request to add pip
based installation steps on GitHub's landing page (README.md).
Full trace:
(py3.6) C:\Users\Abhijeets>conda install -c maciejkula -c pytorch spotlight=0.1.4
Fetching package metadata .................
Solving package specifications:
PackageNotFoundError: Packages missing in current channels:
- spotlight 0.1.4* -> pytorch 0.3.1.*
We have searched for the packages in the following channels:
- https://conda.anaconda.org/maciejkula/win-64
- https://conda.anaconda.org/maciejkula/noarch
- https://conda.anaconda.org/pytorch/win-64
- https://conda.anaconda.org/pytorch/noarch
- https://repo.continuum.io/pkgs/main/win-64
- https://repo.continuum.io/pkgs/main/noarch
- https://repo.continuum.io/pkgs/free/win-64
- https://repo.continuum.io/pkgs/free/noarch
- https://repo.continuum.io/pkgs/r/win-64
- https://repo.continuum.io/pkgs/r/noarch
- https://repo.continuum.io/pkgs/pro/win-64
- https://repo.continuum.io/pkgs/pro/noarch
- https://repo.continuum.io/pkgs/msys2/win-64
- https://repo.continuum.io/pkgs/msys2/noarch
Hi,
I manually downloaded your stable 0.1.1 but with the example code from here I get this error.
HTTPError: 404 Client Error: Not Found for url: https://github.com/maciejkula/recommender_datasets/releases/download/0.1.0%5Cmovielens_100K.hdf5
Seems that the whole "releases" folder isn't present anymore.
Thanks for your help !
I am trying spotlight and noticed two things which I am not sure if they are expected outcome or behavior or not.
Use case:
A simplistic example is as follows:
user_ids : [1,2,3..............................]
item_ids = [1,2,3,......................]
interactions can look like this (userid, itemid):
(1,5), (2,3), (1,1), (9,2), etc etc
Initially, I train the model on some historical data of user-item interactions. After that, each a user interacts with an item, I update the model by calling the fit method.
Case 1: Calling fit repeatedly for ImplicitSequenceModel
model.fit(sequences)
Here I get the following error:
Traceback (most recent call last):
File "spotlight_rec.py", line 110, in <module>
main()
File "spotlight_rec.py", line 80, in main
rt_counter = recommender.train(train_set)
File "spotlight_rec.py", line 38, in train
self.model.fit(self.sequences)
File "/opt/conda/lib/python3.6/site-packages/spotlight/sequence/implicit.py", line 215, in fit
self._check_input(sequences)
File "/opt/conda/lib/python3.6/site-packages/spotlight/sequence/implicit.py", line 192, in _check_input
raise ValueError('Maximum item id greater '
ValueError: Maximum item id greater than number of items in model.
The error happens exactly at this point:
def _check_input(self, item_ids):
if isinstance(item_ids, int):
item_id_max = item_ids
else:
item_id_max = item_ids.max()
if item_id_max >= self._num_items:
raise ValueError('Maximum item id greater '
'than number of items in model.')
This condition (if item_id_max >= self._num_items)
can be true for us because a user could interact with an item which has an id number higher than the number of items.
Our current workarount: To reinitialize when we encounter this error.
Main question: Is it an expected behavior? and how can we handle this situation?
Case 2: Calling fit repeatedly or just one time for ImplicitFactorizationModel
In both these cases, I get an error with the len(item_ids) % mini-batchsize
== 1. For example, if len(item_ids) is 17, and mini-batchsize is 8 then 17%8==1 leading to a value error. In my case, the mini-batchsize is 128. It happends at the last interation in creating user and item embedings in forward
method of BilinearNet
class.
Value Error:
Traceback (most recent call last):
File "src/spotlight_recommender.py", line 136, in <module>
main()
File "src/spotlight_recommender.py", line 114, in main
recommender.train(train_set)
File "src/spotlight_recommender.py", line 38, in train
self.model.fit(data)
File "/home/user/anaconda3/lib/python3.5/site-packages/spotlight/factorization/implicit.py", line 233, in fit
positive_prediction = self._net(user_var, item_var)
File "/home/user/anaconda3/lib/python3.5/site-packages/torch/nn/modules/module.py", line 357, in __call__
result = self.forward(*input, **kwargs)
File "/home/user/anaconda3/lib/python3.5/site-packages/spotlight/factorization/representations.py", line 92, in forward
dot = (user_embedding * item_embedding).sum(1)
RuntimeError: dimension out of range (expected to be in range of [-1, 0], but got 1)
The error occurs in the following code:
def forward(self, user_ids, item_ids):
"""
Compute the forward pass of the representation.
Parameters
----------
user_ids: tensor
Tensor of user indices.
item_ids: tensor
Tensor of item indices.
Returns
-------
predictions: tensor
Tensor of predictions.
"""
user_embedding = self.user_embeddings(user_ids)
item_embedding = self.item_embeddings(item_ids)
user_embedding = user_embedding.squeeze()
item_embedding = item_embedding.squeeze()
user_bias = self.user_biases(user_ids).squeeze()
item_bias = self.item_biases(item_ids).squeeze()
dot = (user_embedding * item_embedding).sum(1)
return dot + user_bias + item_bias
The error occurs whenlen(item_ids)
is 1. It works for all the rest values. After some logging, it goes like this. 128, 128, 128....................,1 which causes an error. When len(item_ids) is 1, I also notice that item_embedding looks different too.
Current work arrount: We just skip this iteration and retrain it in the next.
Main question: Are we doing something wrong if its an expected behavior? Why such situation can't be handled within this method?
Extra:
Normal item embeding when the size is 128:
[torch.FloatTensor of size 128x32]
<class 'torch.autograd.variable.Variable'>
Variable containing:
-0.0585 0.1001 0.0464 ... 0.0846 0.1175 -0.1926
-0.0459 0.0973 -0.1781 ... 0.0870 0.0962 -0.1985
0.1101 0.0706 -0.0717 ... 0.1948 -0.0446 -0.2401
... โฑ ...
0.1779 0.1666 0.1758 ... 0.1664 0.0789 -0.0398
0.2808 0.2151 0.0562 ... 0.0679 0.0388 0.0283
-0.6652 0.6864 -0.8043 ... -0.6818 -0.3306 0.4183
item embeding in the last round when the size is 1:
[torch.FloatTensor of size 128x32]
<class 'torch.autograd.variable.Variable'>
Variable containing:
0.4371
0.5031
-0.1965
0.5416
0.2742
-0.4299
0.4275
0.5921
-0.5964
-0.3495
-0.3941
0.4900
0.5652
-0.4823
0.4789
0.4647
-0.4036
0.3759
-0.4495
0.1618
0.4748
-0.5056
-0.6162
-0.5277
-0.4750
0.4905
-0.3921
0.4476
0.4825
-0.4493
0.5846
-0.5344
When i try to import some modules:
from spotlight.datasets.movielens import get_movielens_dataset
Traceback (most recent call last):
File "", line 1, in
ModuleNotFoundError: No module named 'spotlight.datasets'
from spotlight.factorization.explicit import ExplicitFactorizationModel
Traceback (most recent call last):
File "", line 1, in
ModuleNotFoundError: No module named 'spotlight.factorization'
Hi,
Is there a way to incorporate user metadata (e.g, age, location) and item (e.g, tag) metadata in spotlight?
Hi, i have a dataset in a cvs format (https://raw.githubusercontent.com/juanremi/datasets-to-CF-recsys/master/bigquery-frutas-sin-items-no-vistos-ids.csv), the first row are the headers (user, item, rating).
So i modify the file https://raw.githubusercontent.com/maciejkula/spotlight/master/spotlight/datasets/movielens.py to load my own dataset:
import os
import torch
import numpy as np
from spotlight.datasets import _transport
from spotlight.interactions import Interactions
def _get_my_own(dataset):
extension = '.csv'
URL_PREFIX = '/home/user/htdocs/pruebas/datasets/myheaders/'
data = np.genfromtxt(URL_PREFIX + dataset + extension, delimiter=',', names=True, dtype=(int, int, float))
usuarios = data['user']
items = data['item']
ratings = data['rating']
return (usuarios, items, ratings)
def get_my_own_dataset(myfile):
"""
Returns
-------
Interactions: :class:`spotlight.interactions.Interactions`
instance of the interactions class
"""
url = myfile
return Interactions(*_get_my_own(url))
dataset = get_my_own_dataset('bigquery-frutas-sin-items-no-vistos-ids')
print(dataset)
from spotlight.factorization.explicit import ExplicitFactorizationModel
model = ExplicitFactorizationModel(loss='regression',
embedding_dim=128, # latent dimensionality
n_iter=10, # number of epochs of training
batch_size=1024, # minibatch size
l2=1e-9, # strength of L2 regularization
learning_rate=1e-3,
use_cuda=torch.cuda.is_available())
from spotlight.cross_validation import random_train_test_split
train, test = random_train_test_split(dataset, random_state=np.random.RandomState(42))
print('Split into \n {} and \n {}.'.format(train, test))
model.fit(train, verbose=True)
but when i run the output is:
<Interactions dataset (14 users x 17 items x 139 interactions)>
Split into
<Interactions dataset (14 users x 17 items x 111 interactions)> and
<Interactions dataset (14 users x 17 items x 28 interactions)>.
Traceback (most recent call last):
File "condadata.py", line 52, in <module>
model.fit(train, verbose=True)
File "/var/www/miniconda2/envs/conda1/lib/python2.7/site-packages/spotlight/factorization/explicit.py", line 172, in fit
loss = loss_fnc(ratings_var, predictions)
File "/var/www/miniconda2/envs/conda1/lib/python2.7/site-packages/spotlight/losses.py", line 191, in regression_loss
return ((observed_ratings - predicted_ratings) ** 2).mean()
File "/var/www/miniconda2/envs/conda1/lib/python2.7/site-packages/torch/autograd/variable.py", line 752, in __sub__
return self.sub(other)
File "/var/www/miniconda2/envs/conda1/lib/python2.7/site-packages/torch/autograd/variable.py", line 296, in sub
return self._sub(other, False)
File "/var/www/miniconda2/envs/conda1/lib/python2.7/site-packages/torch/autograd/variable.py", line 290, in _sub
return Sub(inplace=inplace)(self, other)
File "/var/www/miniconda2/envs/conda1/lib/python2.7/site-packages/torch/autograd/_functions/basic_ops.py", line 34, in forward
return a.sub(b)
TypeError: sub received an invalid combination of arguments - got (torch.FloatTensor), but expected one of:
* (float value)
didn't match because some of the arguments have invalid types: (torch.FloatTensor)
* (torch.DoubleTensor other)
didn't match because some of the arguments have invalid types: (torch.FloatTensor)
* (float value, torch.DoubleTensor other)
any idea?
When running the tests suite of the master branch on Windows 10, multiple tests currently fail,
pytest -sv tests/
[...]
=============== 34 failed, 24 passed, 3 error in 24.89 seconds ================
full output can be found here.
Two most frequent errors are,
randint
..\spotlight\cross_validation.py:146: in user_based_train_test_split
seed = random_state.randint(minint, maxint)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> ???
E ValueError: high is out of bounds for int32
mtrand.pyx:975: ValueError
torch.index_select
,
def _get_hashed_indices(self, original_indices):
# [...]
hashed_indices = torch.index_select(self._hashes,
0,
> original_indices.squeeze())
E TypeError: torch.index_select received an invalid combination of arguments - got (torch.LongTensor, int, !torch.IntTensor!), but expected (torch.LongTensor source, int dim, torch.LongTensor index)
..\spotlight\layers.py:204: TypeError
Run on the master branch with the following installed depedencies,
import platform; print(platform.platform())
import sys; print("Python", sys.version)
import numpy; print("NumPy", numpy.__version__)
import scipy; print("SciPy", scipy.__version__)
import sklearn; print("Scikit-Learn", sklearn.__version__)
import torch; print("PyTorch", torch.__version__)
print("CUDA available:", torch.cuda.is_available())
print("CUDA version:", torch.version.cuda)
import spotlight; print("spotlight", spotlight.__version__)
Windows-10-10.0.14393-SP0
Python 3.6.3 |Anaconda, Inc.| (default, Nov 8 2017, 15:10:56) [MSC v.1900 64 bit (AMD64)]
NumPy 1.12.1
SciPy 1.0.0
Scikit-Learn 0.19.1
PyTorch 0.3.0b0+591e73e
CUDA available: True
CUDA version: 8.0
spotlight v0.1.3
PyTorch for Windows was installed from the peterjc123
conda channel, as described in pytorch/pytorch#494 (comment)
When installed on Linux with the same approach (though without GPU support), all tests pass.
I'm trying to implement both recommendations and item-item similarities based on an ImplicitFactorizationModel
trained on clickstream data. The former is trivial using the predict
method, but the latter I'm not quite sure about. Looking at the code, I could obtain the item embeddings in a similar way to how it's done in _components.py
and representations.py
, and then apply a similarity measure to the resulting torch tensors.
Is this a sensible approach / Is there a better way to do this / Is it even possible?
Is anyone interested in having the same feature set available in TF?
I'm considering implementing at least implicit matrix factorization in TF, want to know if it's worth making PRs or not.
Hi! Thank you for Spotlight.
I have a similar problem to the issue #80
My OS is Windows 10 x64.
First, i must admit i am new with Anaconda. I have a 'portable' installation Anaconda (with Anaconda3-5.1.0-Windows-x86_64.exe).
When i try:
conda install -c maciejkula -c pytorch spotlight=0.1.4
i get;
u:\Python\Anaconda3>conda install -c maciejkula -c pytorch spotlight=0.1.4
Solving environment: failed
PackagesNotFoundError: The following packages are not available from current channels:
- spotlight=0.1.4
- pytorch=0.3.1
Current channels:
- https://conda.anaconda.org/maciejkula/win-64
- https://conda.anaconda.org/maciejkula/noarch
- https://conda.anaconda.org/pytorch/win-64
- https://conda.anaconda.org/pytorch/noarch
- https://repo.continuum.io/pkgs/main/win-64
- https://repo.continuum.io/pkgs/main/noarch
- https://repo.continuum.io/pkgs/free/win-64
- https://repo.continuum.io/pkgs/free/noarch
- https://repo.continuum.io/pkgs/r/win-64
- https://repo.continuum.io/pkgs/r/noarch
- https://repo.continuum.io/pkgs/pro/win-64
- https://repo.continuum.io/pkgs/pro/noarch
- https://repo.continuum.io/pkgs/msys2/win-64
- https://repo.continuum.io/pkgs/msys2/noarch
u:\Python\Anaconda3>
I have not created an environment... Is it necessary?
Perhaps, is it necessary to add channels or so?
I have created a py 3.6 env.
Please let me know how to install it .
conda install -c maciejkula -c pytorch spotlight=0.1.4 is not working
conda install -c maciejkula -c pytorch -c peterjc123 spotlight=0.1.4 even this fails
At the moment it goes through multiple calls when all predictions could be computed in one go.
It could be useful to implement diversification metrics like Coverage, Novelty, Diversity, and Serendipity besides the accuracy metrics.
At the moment, some operations are inefficient due to excessive allocation of new tensors. We could keep tensors around and re-use them for each minibatch, or use in-place operations to reduce allocation needs.
Does the "weights" parameter of Interactions class mean the weight for a specific interaction?
It seems to me that the parameter was not used in any place.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.