fneighcf

Python implementation of the collaborative filtering algorithm for explicit feedback data described in Koren, Y. (2010). Factor in the neighbors: Scalable and accurate collaborative filtering. ACM Transactions on Knowledge Discovery from Data (TKDD), 4(1), 1.

The paper (Factor in the neighbors: Scalable and accurate collaborative filtering) can be downloaded here: http://courses.ischool.berkeley.edu/i290-dm/s11/SECURE/a1-koren.pdf

Model description

The model consist of predicting the rating that a user would give to an item by parameterizing how much a deviation from average in each rated item translates into a higher or lower rating for the item to predict. The basic formula is as follows:

rating_ui = global_bias + user_bias_u + item_bias_i + sum( (rating_j – bias_uj )*w_ij ) + sum(c_ij)

(Where u is a given user, i is the item to predict, and j are all the items rated by the user. More details can be found in the paper above and in the IPython notebook example below).

The rating deviations are calculated from fixed user and item biases, which are calculated by a simple formula from the data, while the model also includes variable user and item biases as part of its parameters (see formula above).

While rating predictions from this model usually don't achieve as low RMSE as those generated by matrix factorization models, they allow for immediate update of recommendations as soon as the user submits feedback for more items without updating the model itself, and can also start recommending to new users as soon as they rate items.

Making recomendations from this parameterized model in larger datasets is a lot faster than user-user or item-item similarity or weighted nearest neighbors (and better quality), albeit not as fast as low-rank matrix factorization models.

Note that, if the user bias is eliminated from the formula, the problem becomes a series of separate linear regressions (one for each item), trying to predict the centered rating for each rating using as covariates the centered ratings for each other item plus indicator columns for whether each other item was rated (each user who rated a movie is an observation for the model that predicts that movie). This almost equivalent formulation is massively parallelizable and can be fit with standard packages such as scikit-learn.

Instalation

Package is available on PyPI, can be installed with

pip install fneighcf

As it contains Cython code, it requires a C compiler. In Windows, this usually means it requires a Visual Studio installation (or MinGW + gcc), and if using Anaconda, might also require configuring it to use said Visual Studio instead of MinGW, otherwise the installation from pip might fail. For more details see this guide: Cython Extensions On Windows

On Python 2.7 on Windows, it might additionally requiring installing extra Visual Basic modules.

On Linux and Mac, the pip install should work out-of-the-box, as long as the system has gcc (included by default in most installs).

Usage

import pandas as pd
from fneighcf import FNeigh

## creating fake movie ratings
ratings = pd.DataFrame({
    'Rating':[3,4,5,6,6],
    'UserId':[0,0,0,1,1],
    'ItemId':['a','b','c','b','c']
})

## initializing the model
rec = FNeigh(center_ratings=True, alpha=0.5, lambda_bu=10, lambda_bi=25,
             lambda_u=5e-1, lambda_i=5e-2, lambda_W=5e-3, lambda_C=5e-2)

## fitting it to the data
rec.fit(ratings, method='lbfgs', opts_lbfgs={'maxiter':300}, verbose=False)
rec.fit(ratings, method='sgd', epochs=100, step_size=5e-3, decrease_step=True,
		early_stop=True, random_seed=None, verbose=False)

## making predictions on test data
rec.predict(uids=[0,1], iids=['a','a'])

## getting top-N recommendations for a user
rec.topN(n=2, uid=0)
rec.topN(n=2, ratings=[4,3], items=['a','b'])

A more detailed example with the MovieLens data can be found in this IPython notebook, and docstrings are available for internal documentation (e.g. help(FNeigh), help(FNeight.fit)).

Implementation Notes

The package implements only the full model with all item-item effects considered, and with the full square item-item matrices rather than the reduced latent-factor versions, so its memory requirement is quadratic in the number of items. It doesn't implement the user-user models.

All users and items are reindexed internally at fitting time, so you can pass any hashable object (numbers, strings, characters, etc.) as IDs.

The core computations are implemented in fast Cython code. The model can be fit using L-BFGS-B, which calculates the full gradient and objective function in order to update the parameters at each iteration, or using SGD (Stochastic Gradient Descent) as described in the paper (see references at the bottom), which iterates through the data updating the parameters right after processing each individual rating.

SGD has lower memory requirements and is able to achieve a low error in less time, but it likely won't reach the optimal solution and requires tuning the step size and other parameters. This is likely an appropriate alternative for large datasets.

L-BFGS-B requires more memory, and is slower to converge, but it always reaches the optimal solution and doesn't require playing with parameters like the step size. Recommended for smaller datasets. Can use multithreading for calculating the gradient and objective function, but the L-BFGS algorithm used is single-threaded, so the speed-up will not be much.

The package has only been tested under Python 3.

References

Koren, Y. (2010). Factor in the neighbors: Scalable and accurate collaborative filtering. ACM Transactions on Knowledge Discovery from Data (TKDD), 4(1), 1.

afcarl / fneighcf Goto Github PK

fneighcf's Introduction

fneighcf

Model description

Instalation

Usage

Implementation Notes

References

fneighcf's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent