Git Product home page Git Product logo

fneighcf's Introduction

fneighcf

Python implementation of the collaborative filtering algorithm for explicit feedback data described in Koren, Y. (2010). Factor in the neighbors: Scalable and accurate collaborative filtering. ACM Transactions on Knowledge Discovery from Data (TKDD), 4(1), 1.

The paper (Factor in the neighbors: Scalable and accurate collaborative filtering) can be downloaded here: http://courses.ischool.berkeley.edu/i290-dm/s11/SECURE/a1-koren.pdf

Model description

The model consist of predicting the rating that a user would give to an item by parameterizing how much a deviation from average in each rated item translates into a higher or lower rating for the item to predict. The basic formula is as follows:

rating_ui = global_bias + user_bias_u + item_bias_i + sum( (rating_j โ€“ bias_uj )*w_ij ) + sum(c_ij)

(Where u is a given user, i is the item to predict, and j are all the items rated by the user. More details can be found in the paper above and in the IPython notebook example below).

The rating deviations are calculated from fixed user and item biases, which are calculated by a simple formula from the data, while the model also includes variable user and item biases as part of its parameters (see formula above).

While rating predictions from this model usually don't achieve as low RMSE as those generated by matrix factorization models, they allow for immediate update of recommendations as soon as the user submits feedback for more items without updating the model itself, and can also start recommending to new users as soon as they rate items.

Making recomendations from this parameterized model in larger datasets is a lot faster than user-user or item-item similarity or weighted nearest neighbors (and better quality), albeit not as fast as low-rank matrix factorization models.

Note that, if the user bias is eliminated from the formula, the problem becomes a series of separate linear regressions (one for each item), trying to predict the centered rating for each rating using as covariates the centered ratings for each other item plus indicator columns for whether each other item was rated (each user who rated a movie is an observation for the model that predicts that movie). This almost equivalent formulation is massively parallelizable and can be fit with standard packages such as scikit-learn.

Instalation

Package is available on PyPI, can be installed with

pip install fneighcf

As it contains Cython code, it requires a C compiler. In Windows, this usually means it requires a Visual Studio installation (or MinGW + gcc), and if using Anaconda, might also require configuring it to use said Visual Studio instead of MinGW, otherwise the installation from pip might fail. For more details see this guide: Cython Extensions On Windows

On Python 2.7 on Windows, it might additionally requiring installing extra Visual Basic modules.

On Linux and Mac, the pip install should work out-of-the-box, as long as the system has gcc (included by default in most installs).

Usage

import pandas as pd
from fneighcf import FNeigh

## creating fake movie ratings
ratings = pd.DataFrame({
    'Rating':[3,4,5,6,6],
    'UserId':[0,0,0,1,1],
    'ItemId':['a','b','c','b','c']
})

## initializing the model
rec = FNeigh(center_ratings=True, alpha=0.5, lambda_bu=10, lambda_bi=25,
             lambda_u=5e-1, lambda_i=5e-2, lambda_W=5e-3, lambda_C=5e-2)

## fitting it to the data
rec.fit(ratings, method='lbfgs', opts_lbfgs={'maxiter':300}, verbose=False)
rec.fit(ratings, method='sgd', epochs=100, step_size=5e-3, decrease_step=True,
		early_stop=True, random_seed=None, verbose=False)

## making predictions on test data
rec.predict(uids=[0,1], iids=['a','a'])

## getting top-N recommendations for a user
rec.topN(n=2, uid=0)
rec.topN(n=2, ratings=[4,3], items=['a','b'])

A more detailed example with the MovieLens data can be found in this IPython notebook, and docstrings are available for internal documentation (e.g. help(FNeigh), help(FNeight.fit)).

Implementation Notes

The package implements only the full model with all item-item effects considered, and with the full square item-item matrices rather than the reduced latent-factor versions, so its memory requirement is quadratic in the number of items. It doesn't implement the user-user models.

All users and items are reindexed internally at fitting time, so you can pass any hashable object (numbers, strings, characters, etc.) as IDs.

The core computations are implemented in fast Cython code. The model can be fit using L-BFGS-B, which calculates the full gradient and objective function in order to update the parameters at each iteration, or using SGD (Stochastic Gradient Descent) as described in the paper (see references at the bottom), which iterates through the data updating the parameters right after processing each individual rating.

SGD has lower memory requirements and is able to achieve a low error in less time, but it likely won't reach the optimal solution and requires tuning the step size and other parameters. This is likely an appropriate alternative for large datasets.

L-BFGS-B requires more memory, and is slower to converge, but it always reaches the optimal solution and doesn't require playing with parameters like the step size. Recommended for smaller datasets. Can use multithreading for calculating the gradient and objective function, but the L-BFGS algorithm used is single-threaded, so the speed-up will not be much.

The package has only been tested under Python 3.

References

  • Koren, Y. (2010). Factor in the neighbors: Scalable and accurate collaborative filtering. ACM Transactions on Knowledge Discovery from Data (TKDD), 4(1), 1.

fneighcf's People

Contributors

david-cortes avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.